QUERYING, EXPLORING AND MINING THE EXTENDED DOCUMENT · Querying, Exploring and Mining the Extended Document Nikolaos Sarkas Doctor of Philosophy Graduate Department of Computer Science

QUERYING, EXPLORING AND M INING THE EXTENDED DOCUMENT

by

Nikolaos Sarkas

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Computer ScienceUniversity of Toronto

Copyright c© 2011 by Nikolaos Sarkas

Abstract

Querying, Exploring and Mining the Extended Document

Nikolaos Sarkas

Doctor of Philosophy

Graduate Department of Computer Science

University of Toronto

2011

The evolution of the Web into an interactive medium that encourages active user engagement has ignited a huge

increase in the amount, complexity and diversity of available textual data. This evolution forces us to reevaluate

our view of documents as simple pieces of text and of documentcollections as immutable and isolated.Extended

documentspublished in the context of blogs, micro-blogs, on-line social networks, customer feedback portals,

can be associated with a wealth of meta-data in addition to their textual component: tags, links, sentiment, entities

mentioned in text, etc. Collections of user-generated documents grow, evolve, co-exist and interact: they are

dynamicandintegrated.

These unique characteristics of modern documents and document collections present us with exciting oppor-

tunities for improving the way we interact with them. At the same time, this additional complexity combined with

the vast amounts of available textual data present us with formidable computational challenges. In this context,

we introduce, study and extensively evaluate an array of effective and efficient solutions for querying, exploring

and mining extended documents, dynamic and integrated document collections.

For collections of socially annotated extended documents,we present an improved probabilistic search and

ranking approach based on our growing understanding of the dynamics of the social annotation process.

For extended documents, such as blog posts, associated withentities extracted from text and categorical at-

tributes, we enable their interactive exploration throughthe efficient computation of strong entity associations.

Associated entities are computed for all possible attribute value restrictions of the document collection.

For extended documents, such as user reviews, annotated with a numerical rating, we introduce a keyword-

query refinement approach. The solution enables the interactive navigation and exploration of large result sets.

We extend the skyline query to document streams, such as newsarticles, associated with categorical attributes

and partially-ordered domains. The technique incrementally maintains a small set of recent, uniquely interesting

extended documents from the stream.

Finally, we introduce a solution for the scalable integration of structured data sources into Web search. Queries

are analyzed in order to determine what structured data, if any, should be used to augment Web search results.

ii

Acknowledgements and Dedication

This work would not have been possible without the help of my supervisor Nick Koudas, collaborators, friends

and family. It is dedicated to the memory of my grandmother Andriana.

iii

Contents

1 Introduction 1

1.1 Querying, Exploring and Mining Documents . . . . . . . . . . . .. . . . . . . . . . . . . . . . 1

1.2 The Extended Document, Dynamic and Integrated Collections . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 5

2 Related Work 10

2.1 Link Analysis and Social Networks . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 10

2.2 Faceted Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 12

2.3 Social Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 14

2.4 Information and Entity Extraction . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 16

2.5 Sentiment and User Feedback . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 18

2.6 Dynamic Document Collections . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 19

2.7 Integrated Document Collections . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 20

I Exploration 23

3 Interactive Exploration of Extended Document Collections 24

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 24

3.2 Comparison to Existing Work . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 26

3.3 Formal Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 27

3.4 Intra-Slice Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 30

3.4.1 Evaluating All Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 30

3.4.2 Problem 1: Threshold Variation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 33

3.4.3 Problem 2: Top-k Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5 Data-Assembly Infrastructure . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 38

iv

3.6 Using Inter-Slice Information . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 41

3.6.1 Problem 1: Threshold Variation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 41

3.6.2 Problem 2: Top-k Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 43

3.7.1 Computing Strongly Associated Groups of Entities . . .. . . . . . . . . . . . . . . . . . 43

3.7.2 Operating on a Random Sample of the Collection . . . . . . .. . . . . . . . . . . . . . . 45

3.8 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 45

3.8.1 Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 46

3.8.2 Performance on Real Data . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 47

3.8.3 Performance on Synthetic Data . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 50

3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 53

4 Query Expansion for Extended Documents 54

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 54


4.3 Measure-driven Query Expansion . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 56

4.3.1 Defining Interesting Expansions . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 57

4.4 Implementing Query Expansion . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 60

4.4.1 Full Materialization . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 61

4.4.2 No Materialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 61

4.4.3 Partial Materialization . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 62

4.5 Working with Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 67

4.5.1 Progressive Bounding of Co-occurrences . . . . . . . . . . .. . . . . . . . . . . . . . . 68


4.6.1 Expansion Length and Problem Dimensionality . . . . . . .. . . . . . . . . . . . . . . . 77

4.6.2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 78

4.6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 78

4.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 84

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 85

II Querying 86

5 Improved Search for Socially Annotated Documents 87

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 87

v


5.3 Principled Ranking of Extended Documents with Tag Meta-Data . . . . . . . . . . . . . . . . . . 89

5.3.1 Probabilistic Foundations . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 89

5.3.2 Dynamics and Properties of the Social Annotation Process . . . . . . . . . . . . . . . . . 90

5.3.3 N-gram Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 92

5.3.4 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 93

5.3.5 Advantages of Linear Interpolation . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 94

5.4 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 95

5.4.1 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 95

5.4.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 97

5.4.3 Adapting Unconstrained Optimization Methods for Constrained Optimization . . . . . . . 97

5.4.4 Incremental Maintenance . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 102

5.4.5 Generalization for N-gram Models . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 102

5.5 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 105


5.6.1 Optimization Efficiency . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 106

5.6.2 Ranking Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 109

5.6.3 Linearly Interpolatedn-grams in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 117

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 118

6 Skyline Maintenance for Dynamic Document Collections 119

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 119


6.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 121

6.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 121

6.3.2 Skyline maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 123

6.4 Efficient Skyline Maintenance for Extended Documents . .. . . . . . . . . . . . . . . . . . . . . 124

6.4.1 Topological Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 125

6.4.2 Organizing the Skybuffer Documents in a Grid . . . . . . . .. . . . . . . . . . . . . . . 126

6.4.3 Improving the Skybuffer Organization . . . . . . . . . . . . .. . . . . . . . . . . . . . . 128

6.4.4 Arrangement Representation of the Skyline . . . . . . . . .. . . . . . . . . . . . . . . . 135

6.4.5 Numerical Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 138

vi


6.5.1 Adapting Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 139

6.5.2 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 140

6.5.3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 141

6.5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 141

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 147

7 Integrating Structured Data into Web Search 148

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 148

7.2 Structured Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 151

7.3 Producing Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 155

7.4 Scoring Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 157

7.5 Learning the Generative Model . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 161

7.5.1 Estimating Token-generation Probabilities . . . . . . .. . . . . . . . . . . . . . . . . . . 161

7.5.2 Estimating Template Probabilities . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 164


7.6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 166

7.6.2 Scoring Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 167

7.6.3 Handling General Web Queries . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 169

7.6.4 Understanding Annotation Pitfalls . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 171

7.6.5 Efficiency of Annotation Process . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 174


7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 176

8 Conclusions 177

8.1 Lessons Learnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 178

8.2 New Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 179

8.2.1 Mining entity associations from all the slices of a document collection . . . . . . . . . . . 179

8.2.2 Measure-driven keyword query expansion . . . . . . . . . . .. . . . . . . . . . . . . . . 179

8.2.3 Improved keyword search using tags . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 180

8.2.4 Dynamic skyline maintenance for extended documents .. . . . . . . . . . . . . . . . . . 180

8.2.5 Integrating structured data into Web search . . . . . . . .. . . . . . . . . . . . . . . . . 181

8.3 Possible Future Directions . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 181

vii

Bibliography 183

viii

Chapter 1

Introduction

1.1 Querying, Exploring and Mining Documents

Collections of digital documents can be used to support information and knowledge discovery applications of

great diversity and richness. Roughly, the functionality built on top of text corpora can be classified intoquerying,

explorationandmining.

Querying: The archetypal mode of interaction with a document collection is through querying. In many applica-

tion scenarios the user has a well defined need for a piece of information. Querying a document collection

involves expressing this information need and satisfying it by locating and ranking a handful of hopefully

relevant documents. For instance, in Web search users express their information need in the form of a

keyword query and as a response receive a ranked list of Web pages deemed to contain the information

requested.

Exploration: An information need is not always precise. Pushing this observation to the extreme, it can be

completely absent and, hence, speculative access to the document collection is required. In such scenar-

ios querying is by definition an ineffective mode of interaction. Exploratory applications compensate by

providing views to the document collection at different levels of granularity and facilitating the progressive

refinement and crystallization of the loosely-defined information need. As an example, consider a hierar-

chical clustering of the documents [34]. This organizationallows a user to navigate the cluster hierarchy, at

each iteration refining or broadening the scope of his view asfeedback about the thematic structure of the

collection is received.

Mining: Document collections can also support a wide range of information extraction, pattern and knowledge

1

CHAPTER 1. INTRODUCTION 2

discovery tasks, usually domain specific. We use the termmining in its broadest sense1 to refer to these

applications. As such, an example of a mining application isInformation Extraction [10, 5, 58], whose aim

is to extract useful, structured information from documents, in the form of entities and their relationships

mentioned in text.

The boundaries between these different modes of functionality are fuzzy. Querying and exploration complement

each other, while both can be improved and supported by utilizing the output of text mining tasks.

1.2 The Extended Document, Dynamic and Integrated Collections

Applications originally adopted the view that documents are by definition equal to their textual component. It was

not uncommon for even the textual structure to be ignored, inwhich case a document was viewed as the multi-set

of the words comprising it. This approach is far from unreasonable and has been the basis for significant progress

on querying, exploring and mining textual data [34, 106]. However, it has become incompatible with the nature,

complexity and quantity of textual content generated on-line at a torrential pace.

The evolution of the Web into an interactive medium that promotes and encourages active user engagement

and co-operation has ignited a huge increase in the amount ofavailable textual data: massive volumes are con-

tinuously posted on-line in the context of blogs2, micro-blogs3 and social networks4, customer feedback portals5,

etc. Such documents are naturally associated with a wealth of information in addition to their textual component,

while advances in Information Extraction and Natural Language Processing technology help highlight additional

document characteristics.

Example 1.1. As a first illuminating example consider the Blogosphere. Millions of bloggers author on a daily

basis in excess of 2.5 million blog posts, where they commentupon a wide variety of private matters, but also

products, people and public on-going events. Powerful Information Extraction machinery [10, 5, 58] can expose

entities mentioned in the text and their relationships.

The collective “discussion” on the Blogosphere takes placein a semi-anonymous manner, as bloggers typ-

ically reveal their demographic profile (age, gender, occupation, location). Novel tools also allow us to infer

additional information about bloggers, such as their standing among their peers [96, 109]. Hence, blog posts can

be associated with the demographic information of their authors.

Finally, blog posts also link to each other, providing us with knowledge about relationships between posts or

1Broader than the scope ofdata miningwhich is usually exclusively associated with pattern discovery.2www.blogspot.com3www.twitter.com4www.facebook.com5www.epinions.com


their authors and allowing us to understand how informationis diffused in this massive, global network.

Example 1.2. On-line forums such as customer feedback portals offer unique opportunities for individuals to

engage with sellers or other customers and provide their comments and experiences. These interactions are

typically summarized by the assignment of a numerical or “star” rating to a product or the quality of a service.

Numerous such applications exist, like Amazon’s customer feedback and Epinions6. But even if ratings are not

explicitly provided, sentiment analysis tools [145, 120, 119] can identify with a high degree of confidence the

governing sentiment (negative, neutral or positive) expressed in a piece of text, which in turn can be translated

into a numerical rating.

Example 1.3. Users of on-line, collaborative tagging systems add to their personal collection documents such as

Web pages, blog posts, scientific publications, etc., and associate with each of them a short sequence of keywords

widely known as tags. Each tag sequence, referred to as an assignment, is a concise and accurate summary of the

relevant resource’s content according to the user’s opinion. Given the overlap among the individual collections,

documents accumulate a large number of assignments, each one of them posted by a different individual.

Example 1.4. Micro-blogging services and on-line social networks offera remarkably similar model of interac-

tion – users post short text messages or URLs which are pushedto their friends or followers. The diversity of

non-textual information associated with a piece of text oneor two sentences long is impressive: the exact time of

authorship, links to other web pages or messages, author demographic information, the underlying social network

of the sender, sentiment, entities mentioned, even the precise location of the author when a mobile device was

used to publish it.

The wealth of non-textual information associated with content published on-line is redefining the way we

utilize and interact with documents. The value of suchmeta-data is not just a premise. Their use has already

given birth to novel and improved applications. There is no need to conjure up esoteric examples of such ad-

vances. Consider how thehyperlinkrevolutionized Web search and mining. Its creative use has enabled improved

Web search quality [31], personalized Web search [84], the ability to identify Web communities [94], study the

diffusion of information [1], and much more (Chapter 2).

In a similar manner, documents published on-line and associated with tags, entities, sentiment, author infor-

mation, etc., are being queried, explored and mined in ways and accuracy not possible for vanilla documents.

Meta-data are not simply ancillary information to the document’s textual component, but an integral part of what

the document is and the information it carries. This renewedview of the document as a datum that does not begin

and end with its textual component, but is extended to include as equally valuable and important the meta-data

associated with it, gives rise to the concept of theextended document.

6www.epinions.com


The same developments that gave birth to the extended document have also rendered obsolete the view of

document collections as immutable and isolated from other valuable data collections, textual or structured.

Example 1.5. At the time of writing, people around the world author on a daily basis in excess of 2.5 mil-

lion blog posts (30/sec), 55 million micro-blog posts (tweets, 630/sec), and 900 million social status messages

(10400/sec). New content is generated in a never ending, streaming fashion, continuously expanding and en-

riching the corresponding document collections. Additionally, the focus and characteristics of newly generated

content is constantly evolving, reflecting and shaping events, public opinion and memes7.

Example 1.6. Data sources publicly available on-line exhibit great diversity. The Web is evolving from a col-

lection of web pages to a federation of document collections, each one with unique characteristics. Web pages,

blog posts, tweets, social status messages, wikis, forums,user reviews, etc., form conceptually distinct document

collections. However, they do not exist in isolation from each other: they co-exist, co-evolve and interact. Besides

textual data, structured data are becoming an integral partof the Web, whether they lie in public “deep web”

databases [23] or correspond to knowledge about entities and their relationships extracted from Web textual data.

As the above examples illustrate, document collections arefar from static and isolated. They are “alive”;

growing and evolving; co-existing and interacting. They are dynamic and integrated. An impressive gamut of

novel applications for querying, exploring and mining textual data is enabled by utilizing the evolving nature of

document collections and leveraging the synergies betweendifferent data collections.

One could even argue that utilizing these emerging characteristics of documents and document collections

is not simply an opportunity, but rather the only possible way forward. The amount and diversity of textual

data available on-line is immense and the pace at which it is generated is accelerating, as more people become

comfortable with participating in on-line activities. Thecontent’s focus and characteristics change as fast as the

world around us and novel applications continuously join the existing ecosystem and generate new types of textual

content and meta-data.

It is not clear that textual data at this scale, complexity and rate of change can be interacted with and made

sense of by ignoring everything but their textual component. Would searching for information within billions

of Web pages be possible without the help of link analysis? Similarly, can search and exploration of billions of

blogs, tweets, social status messages, user reviews, etc.,be possible without utilizing notions of time, social or

demographic proximity, and distillations such as entitiesand sentiment? Is effective search for information and

knowledge possible if we only focus on a thin slice of the available data at a time?

The growth of on-line user participation, with all its economic and social benefits, is silently supported by

techniques and applications that intermediate and connectusers producing and consuming information. Without

7A postulated unit of cultural ideas, symbols or practices [49].


effective solutions for searching, exploring and mining textual data, this continuing growth is far from granted.

For all the above reasons it is becoming essential to unlock the information contained in meta-data; to develop

techniques that handle rapidly changing corpora; to push the boundaries of the data sets utilized by our applica-

tions.

Nevertheless, developing successful applications is not astraightforward enterprize, as we are presented with

formidable challenges. We can identify two main sources of complexity: one is conceptual and one is computa-

tional.

First, utilizing information about documents in addition to their textual component introduces increased con-

ceptual complexity. Querying, exploring and mining extended documents demands that we reason about the

relations and dependencies among text and meta-data. The same is true of integrated document collections. The

relations among diverse types of data is rarely straightforward and deterministic. The challenge is compounded

by the great variance in the data attributed to the fact that content is increasingly the result of social user activity

rather than professional authorship.

Second, the size of available document collections is immense, while new textual content is being generated at

a staggering pace. Yet, applications such as querying and exploration demand interactive response times. Mining

tasks need also be highly efficient and, where possible, amenable to incremental computation, since the dynamic

nature of document collections requires their frequent application.

1.3 Contributions and Outline

Motivated by these developments, the present thesis contributes to a growing body of work that seeks to develop

novel and improved techniques for querying, exploring and mining textual data by exploiting the extended nature

of documents and document collections. We present five solutions for interactively querying and exploring doc-

ument collections. Although we do not present a stand-alonemining application, the solutions presented will be

frequently supported by data mining tasks.

The contributed techniques are “distinct” from each other in the following sense. Note that by the very notion

of the extended document, a technique cannot be general enough to be applicable toanytype of document. Each

approach needs to focus on creatively and efficiently utilizing different types of meta-data, available in different

application domains. This is also true for integrated document collections. Different approaches might be needed

based on the nature of co-existing data collections and the task at hand.

This observation holds true for the solutions to be presented: each one is applicable to a different class of

extended documents. Nevertheless, the fact that they are not, and cannot be, completely general, does not imply

that their scope is limited. For instance, we discuss (Chapter 3) an approach for interactively exploring a collection


of extended documents whose associated meta-data are entities extracted from text and categorical attributes.

While the presence of entities and categorical attributes is a prerequisite for applying the technique, it is applicable

to all extended document collections with these characteristics.

In addition, our contributions share two core traits. First, they employ to the extent possible principled proba-

bilistic reasoning. Previously, we identified the complexity of on-line content as a major challenge in developing

successful solutions. The use of probabilistic reasoning,rather than ad-hoc heuristics, in explicitly or implicitly

capturing the user behavior and activities giving rise to this complexity, allows us to effectively address it. Sec-

ond, they were designed with the explicit goal of being highly efficient and applicable to vast, real-world data

collections. We invest considerable effort in validating this claim, by presenting extensive experimental results on

real sets of data.

More specifically, with respect to exploring document collections, we make the following contributions. Part

of this work also appears in [130, 9, 131].

Extended Documents with Entity Mention and Categorical Meta-data [130, 9]

The vast amount and great diversity of user generated content published on-line necessitates novel paradigms for

its understanding and exploration. To this end, we introduce an efficient methodology for discoveringstrong entity

associationswithin all theslices(categorical meta-data value restrictions) of a document collection. Since related

documents mention approximately the same group of core entities (people, locations, companies, products, etc.),

the entity associations discovered can be used to expose underlying themes within each slice of the document

collection. This and other relevant information can be interactively presented on demand as one “drills-down” to

a slice of interest, or compared for different slices.

We devise efficient algorithms capable of addressing two flavors of the core problem: algorithm THR-ENT for

computing all sufficiently strong entity associations (Threshold Variation) and algorithm TOP-ENT for computing

the top-k strongest entity associations (Top-k Variation), within each slice of the extended document collection.

Algorithm THR-ENT eliminates from consideration provably weak associations, while TOP-ENT supports early

termination. A unique characteristic of algorithms THR-ENT and TOP-ENT is their ability to accommodate any

plausible alternative for quantifying the degree of association between entities. This trait enables the use of

complex but robuststatistical correlationmeasures that are most appropriate for text mining tasks. Finally, the

application of the algorithms and their variations on all the slices of the collection is supported by an efficient and

nimble infrastructure that exploits slice overlap.

The efficiency and applicability of the proposed techniquesto massive document collections is demonstrated

by means of a thorough experimental evaluation that employssynthetically generated and real world data, com-

prised of millions of documents. The utility in this contextof statistical correlation measures, as was previously


observed in the literature, is verified through the public implementation of the proposed functionality [9] and a

smaller scale demonstration on real data.

Extended Documents with User Rating/Sentiment Meta-data [131]

In the context of on-line social activity, it is common for users to express their views and opinions on products,

services, etc., in reviews both formal (retailer or specialized Web sites) and informal (blogs, forums). These

interactions are typically summarized with a numerical or “star” rating, which can be explicitly provided by the

user or detected by means of a sentiment analysis tool. Hence, user reviews can be viewed as extended documents

associated with a numerical rating from a small domain.

The information contained in reviews is invaluable to otherusers, but plain keyword search is an ineffective

approach for interacting with, exploring and digesting it.We enable richer interaction by supporting the progres-

sive refinement of a query result in a data-driven manner, through the suggestion of expansions of the query with

additional keywords. The suggested expansions allow one tointeractivelyexplorethe reviews, by focusing on a

particularly interesting subset of the original result set. Examples of such refinements are reviews discussing a

certain feature of the product described in the original query, or subsets of reviews with high or low on average

ratings.

To offer this novel functionality at interactive speeds, weintroduce a framework that is computationally ef-

ficient and nimble in terms of storage requirements. For a user query, the “interestingness” of each candidate

expansion is quantified by means of a fairly general scoring function, instantiated to produce expansions with

the desired characteristics. The top-k highest scoring ones are presented. Our solution utilizes the principle of

Maximum Entropy to efficientlyestimatethe score of a candidate expansion, instead of performing a demanding

exact computation. It is further improved by utilizing Convex Optimization principles that allow us to exploit the

pruning opportunities offered by the natural top-k formulation of the problem.

Performance is evaluated using both synthetic data and large real data sets comprised of blog posts. Our

results indicate that the improvement gains are such that the application of our solution in an interactive scenario

is feasible.

With respect to querying document collections, we make the following contributions. Part of this work also

appears in [132, 133, 134].

Extended Documents with Tag Meta-data [132]

Through a number of on-line applications, users have the ability to associate documents with sequences of de-

scriptive keywords, referred to as tags. Each tag sequence is a concentrated description of the documents content


according to a user. For each document we can have a significant number of such individual opinions. This in-

formation is extremely valuable for a number of applications, including search. Intuitively, Information Retrieval

algorithms attempt to automatically identify what are the keywords that are relevant to a document. With tags, we

have a large number of human users collaborating to generatethis information.

Previous work on searching socially annotated document collections presented ad-hoc approaches to utilizing

tags. Such approaches fail to leverage our growing understanding of the structure and dynamics of the tagging pro-

cess. Efficiency issues were also left unaddressed. Instead, we introduce a principled probabilistic methodology

for determining the relevance of a query to a document’s tags. Our solution utilizesinterpolatedn-gramsto model

the tag sequences associated with each document. The use of interpolatedn-gram models exposes significant and

highly informative tag co-occurrence patterns (correlations) present in the user assignments. The training and

incremental maintenance of the interpolatedn-gram models is performed by means of a novel constrained opti-

mization framework that employs powerful numerical optimization techniques and exploits the unique properties

of both the function to be optimized and the parameter domain.

We orchestrated a large scale experimental evaluation involving a large crawl of the most popular social

annotation application and tens of independent human judges. Our results demonstrate significant improvement

in both retrieval precision (+30%) and optimization efficiency (4×).

Dynamic Collection of Extended Documents with Partially-Ordered Categorical Meta-data [133]

The skyline of an extended document collection annotated with categorical attributes can be used to single-out

documents that possess a uniquely interesting combinationof attribute values that no other document can match,

by having more preferable values in all its attributes. The subsequent uses of the skyline documents are many and

depend on the application. In addition, the skyline is a valuable concept not only for static document collections,

but also, perhaps more so, for streaming, dynamic document collections.

The are two main challenges in computing the skyline of a dynamic collection. First, recomputing the skyline

from scratch each time a new document arrives or an old one expires is wasteful. Instead, the skyline needs to

be incrementally maintainedafter such updates. Second, extended document meta-data are typicallycategorical

in nature and, hence, preferences can define apartial orderingof their domain. This flexibility complicates the

“comparison” of two documents’ values.

In order to address these challenges, we adopt a generic skyline maintenance framework and design its two

building blocks: A grid-based data structure for indexing the most recent documents of the collection and a data

structure based on geometric arrangements to index the current skyline documents.

Our experimental evaluation reveals one order of magnitudeperformance improvement over the adaptation of

a competing alternative that was originally developed for static data, as well as improved scalability in terms of


the number of extended document attributes.

Document Collection Integrated with Structured Data Sources [134]

Search engines are increasingly utilizing diverse sourcesof information, in addition to their Web page index,

in order to better serve user queries. One such source are structured data collections, which we can abstract as

relational tables. Results in response to keyword queries related to products (“50 inch samsung lcd”), movie

showtimes (“inception toronto”), airlines schedules (“morning toronto to cuba flights”), etc., are augmented by

presenting information directly retrieved from structured data tables (e.g., entries from tables TVs, Movies and

Flights respectively for the above examples).

In order to integrate structured data tables into Web searchin this manner, each query needs to be analyzed so

that highly plausible mappings of the query to a table and itsattributes are identified. The challenge in performing

this analysis are twofold. First, a search engine can maintain thousands of structured data tables that are implicitly

and indirectly queried by users that demand response times in the order of milliseconds. The Web query analysis

overhead needs to be minuscule. Second, the intent behind free-text Web queries is rarely clear and unambiguous.

Determining which, if any, tables and their attributes are relevant to a query – with high precision and recall – is

hard.

To address these challenges we introduce a fast and scalablemechanism for obtainingall possible mappings

of a query to the structured data tables. A probabilistic scoring mechanism subsequently estimates the likelihood

of the user intent captured by each mapping and deploys a dynamic, query-specific threshold to eliminate highly

unlikely mappings. The probability estimates utilized in these computations are mined off-line from the structured

and query log data. The techniques are completely unsupervised, obviating the need for costly manual labeling

effort.

The effectiveness and efficiency of our techniques is evaluated using real world queries and data. Overall, it

offers high precision (90%) with good recall (40%), while imposing on average sub-millisecond overhead even in

the presence of 1000 structured data tables.

Outline

The remainder of the thesis is organized as follows. In Chapter 2 we review work on querying/exploring/mining

extended documents, dynamic and integrated document collections, and explore in more detail the background

on which the functionality subsequently presented lies. Part I of the thesis follows (Chapters 3 and 4) where

we introduce our techniques for exploring document collections. Subsequently, Part II (Chapters 5, 6 and 7)

introduces our querying techniques. We offer our closing thoughts in Chapter 8.

Chapter 2

Related Work

The techniques that we subsequently present are part of a growing body of work that seeks to improve querying,

exploration and mining of extended documents, dynamic and integrated document collections. In this Chap-

ter we review part of this activity with the goal of further clarifying the context in which this thesis’ technical

contributions belong.

Most of the existing work on processing extended documents is typically built around a single type of meta-

data. Hence, we choose to structure our presentation of it around the type of meta-data used: links (Section 2.1),

categorical attributes (Section 2.2), tags (Section 2.3),extracted entities (Section 2.4) and user feedback (Section

2.5). In addition, we review work on dynamic (Section 2.6) and integrated (Section 2.7) document collections.

2.1 Link Analysis and Social Networks

One of the first pieces of non-textual information associated with documents to be successfully utilized arehyper-

linksor simply links. Links between documents can be explicit, such as hyperlinks connecting two Web pages, or

implicit when document sources are connected instead, as inthe case of content authored in the context of on-line

social networks.

The first attempts to leverage the elaborate link structure among documents were in the context of Web search

and ranking. The PageRank [31] and HITS [90] algorithms developed in this context are among the most success-

ful. The premise behind utilizing intra-document links is that the relevance of a keyword query to a document’s

textual component cannot be the sole guide in determining the top results to be presented in response to the query,

since a significant number of documents can be highly relevant. Instead, theauthorityor importanceof a Web

page should also be an important factor. As Kleinberg [90] argues, a link from a web pagep to a web pageq

confers, in some measure, authority onq. Hence, links can be used in order to automatically infer a document’s

10

CHAPTER 2. RELATED WORK 11

importance.

The PageRank algorithm maps the document collection into a graph: documents comprise the graphs nodes,

while directed edges among nodes are created whenever a linkexists between the corresponding documents. The

intuition behind the algorithm is that highly authoritative documents should be heavily linked by other highly

authoritative documents. This recursive definition of authority conferred by means of links enables an elegant

approach for computing theglobalauthority of documents. The document graph is viewed as a Markov chain and

authority propagation is equivalent to a traversal of this Markov chain: frequently visited chain states (authoritative

documents) are linked by other frequently visited states (other authoritative documents). Hence, the steady state

distribution of the chain can be used a proxy of the corresponding document authority.

The profound success of the PageRank algorithm led to the development of numerous variants. While PageR-

ank computes the global importance score of a document, Topic-Sensitive PageRank [71] computes a document’s

importance within a subset of documents determined to be relevant to a particular topic. On the other hand, Per-

sonalized PageRank [84] computes user-specific importancescores. Both algorithms utilize the Markov chain

formulation of the PageRank computation problem. They biasthe steady state distribution of the chain towards

the desired subset of states (documents) by adding a biased “teleportation” component to the Markov Chain: at

each step, with probabilitya a state outlink is used to continue the chain traversal, while with probability1− a a

jump towards the biased states occurs.

Besides inferring document importance, intra-document links have been used to combat web spam. Given

that search engines utilize link structure in order to assess the importance of web pages, spammers have an

incentive to create clusters of pages densely linking each other. Due to the mechanics underlying the PageRank

and HITS algorithms, undesirable pages in such clusters would receive high importance scores. [68] presents

an algorithm facilitating the discovery of such link spam. Their algorithm uses the importance scores computed

by two applications of the PageRank algorithm. First, PageRank is normally applied. Then, it is reapplied but

this time using a random teleportation component biased towards a seed of pages that are known not to be spam.

Since good pages rarely link to spam pages, the second application of PageRank should be biased towards good

pages. Pages that exhibit considerable divergence among the scores produced by the two PageRank applications

are candidates for being spam.

Web community detection is a text mining application made possible by link analysis. By utilizing the links

between user-generated content (blogs, tweets, etc.), communities of authors can be discovered. [94] is a pioneer-

ing attempt to organize Web pages into communities. Based onthe observation that communities should include

a core, i.e., a full bipartite graph of hub pages linking to authoritative pages, an efficient algorithm is developed

that hunts for such cores. In a sense, community detection can be viewed as a highly focused document clustering

technique, capable of identifying focused clusters of documents based on their linking patterns instead of their


textual similarity. Among the possible applications enabled by community detection is exploration, since the

identified communities can serve as an entry point for exploring the Web.

Identification of “influential” individuals is an importanttask with numerous applications, including querying.

For instance, Mathioudakis and Koudas [109] identify query-dependent influential blogs, referred to as starters.

Given a query, e.g., “politics”, their goal is to identify blogs whose posts relevant to “politics” receive a sig-

nificantly higher number of links, when compared to the number of outlinks that they contain. Hence, starters

are influential blogs with respect to the particular topic specified by the query. The efficient identification of

query-specific starter blogs is supported by sampling the blog graph inferred by the posts relevant to the query

and their links. The proposed technique allows the approximate computation of the most influential blogs without

processing the entire graph.

Sociology-inspired approaches for studying “influence” require the explicit knowledge of an underlying social

network. For instance, [89] studies the problem of identifying the influential individuals that should be targeted

in order to maximize the spread of an “idea” in the social network. However, information about the underlying

social network is not always available. In such scenarios, linking patterns between user-generated content (e.g.,

blog posts) can help us make inferences about how influentialindividuals or web sites are among their peers.

Similarly, [1] attempt to reconstruct the underlying “influence network” between blogs. While information

spreads through this implied network, evidence of “infection” are not always available. For example, while

a blogger might first view a Youtube video on a different blog and re-publish it on his own, he will not always

provide a link to the blog where he first encountered this information. In order to identify such implicit information

pathways between blogs, [1] train a number of classifiers that are used to determine whether an information link

exists between two blogs.

2.2 Faceted Search

A widely deployed technique that utilizes hierarchical categorical meta-data in order to support enhanced querying

and navigation of textual data collections isFaceted Search, proposed by Pollit [124] and subsequently by Yee

et al. [154]. The faceted search model, which adds exploratory capabilities to the plain keyword search model,

assumes extended documents whose meta-data are orthogonalattributes with hierarchical domains, referred to

as facets. In their presence, the result of a vanilla keyword query canbe refined by “drilling-down” the facet

hierarchies. This interactive process places and gradually tightens constraints on the attributes, allowing one to

identify and concentrate on a fraction of the documents thatsatisfy a keyword query. This slice of the original

result set possesses properties that are considered interesting, expressed as constraints on the document meta-data

attributes.


Perhaps the most well known application of faceted search ison-line stores such as Amazon: a user keyword

query on the store’s product database can potentially retrieve thousands of product pages (extended documents)

that can be refined by Product Type (e.g., Book, Movie), Price, etc. In this context, attributes Product Type

and Price are two independent facets, whose domain is organized in hierarchies (e.g., Product Type⇒ Book⇒

Fiction), that can be traversed in order to refine the original query result.

The facet domains and their hierarchical organization can be meta-data naturally associated with the docu-

ments, can be set manually by an expert or automatically extracted from the document collection on indexing

time. The automatic extraction of multiple orthogonal facets and their associated hierarchies from the document

collection is the focus of work conducted by Dakka et al. [47,46].

In [47] a limited,supervisedapproach for extracting facets and associating documents with them is presented.

The approach assumes a training set of well-defined facets and words associated with the them (e.g., facet “Ani-

mals” and words “cat”, “dog”). Each document is processed and a set of descriptive keywords (nouns) is extracted

from it. These can be considered as the document “meta-data”or descriptive attributes. The keywords are also

extended with their WordNet hypernyms (words with similar meaning) and a classifier is used to assign each

extracted keyword to one of the training set facets. This first stage identifies the facets from the well-specified

training set that are relevant to the particular document collection and associates the document keywords (meta-

data) with facets. The keywords of each facet are subsequently organized into a hierarchy by utilizing keyword

co-occurrence patterns in order to identify subsumption/equivalency relations among the keywords.

Theunsupervisedtechnique presented in [46] builds upon the ideas presentedin [47], although it is primarily

applicable to documents rich in named entities (e.g., “Hillary Clinton”, “Microsoft”, etc.) such as new articles. A

sophisticated array of algorithms is applied to identify named entities within documents. Then, Wikipedia data is

used to identify the broader “category” associated with an named entity, i.e., “Hillary Clinton” is a “Person” and a

“Politician”. The few of these categories are automatically singled out and organized in independent hierarchies.

A few attempts have been made to extend and further improve the basic faceted search model. [21] argues that

while facets are invaluable for refining keyword queries, existing solutions provide too little information in order

to help users select an appropriate refinement. The only information provided, besides a small sample of possible

refinements per facet, is the number of documents contained in the refined result set. For example, if a user queries

an on-line store with keyword “digital camera”, the resultscan potentially be refined by Manufacturer, such as

“Nikon” or “Canon”. The only guide provided for selecting “Nikon” or “Canon” is that 20 cameras are made by

“Nikon” and 30 by “Canon”. Instead, considerably more and useful information can be presented by considering

additional document meta-data, such as the average rating of Nikon and Cannon products, their average price and

so on. Besides introducing this and other improvements, [21] discuss their efficient implementation using only


regular, unmodified and freely available1 document retrieval systems (inverted indices).

Evidently, one of the limitations of faceted search paradigm is its reliance on a handful, well-defined facet

hierarchies. This tends to render the approach inapplicable to domains that exhibit high content variance, since no

meaningful, universal facets exist. In Chapter 4 we presenta technique similar in spirit to faceted search which

utilizes instead more accessible meta-data, such as sentiment extracted from text.

2.3 Social Annotation

Social annotation, also referred to ascollaborative tagging, has been constantly building momentum since its

recent inception and has now reached the critical mass required for driving exciting new applications. For instance,

on December 2008 del.icio.us2, one of the many applications using collaborative tagging,reported 5.3 million

users that have annotated a total of 180 million URLs. Given that the user base and content of such sites has been

observed to double every few months, these numbers only loosely approximate the immense popularity and size

of systems that employ social annotation.

Users of an on-line, collaborative tagging system add to their personal collection a number of documents (e.g.,

Web pages, scientific publications, etc.) and associate with each of them a short sequence of keywords, widely

known astags. Eachtag sequence, referred to as anassignment, is a concise and accurate summary of the relevant

document’s content according to the user’s opinion. The premise of annotating documents in that manner is the

subsequent use of tags in order to facilitate the searching and navigation of one’s personal collection.

As an example, del.icio.us users add to their collection theURLs of interesting Web pages and annotate them

with tags so that they can subsequently search for them easily. Users can discover and add URLs to their collection

by browsing the web, searching in del.icio.us or browsing the collections of other users. Given the considerable

overlap among the individual collections, documents accumulate as meta-data a large number of assignments,

each one of them posted by a different individual.

Research on collaborative tagging has mainly followed two directions. One direction focuses on utilizing this

newly-found wealth of information in the form of tag meta-data to enhance existing applications and develop new

ones. The second attempts to understand, analyze and model the various aspects of the social annotation process.

With respect to searching for and ranking extended documents with tag meta-data, Hotho et al. [76] propose

a static, query-independent ranking of the documents (as well as of users and tags) based on an adaptation of

the PageRank algorithm [31]. Users, documents and tags are first organized in a tripartite graph, whose hyper-

edges are links of the form (user,document,tag). This graphis then collapsed into a normal undirected graph

1For instance, Lucene.2www.delicious.com


whose nodes represent indiscriminately users, documents and tags, while edge weights count co-occurrences of

the entities in the hyper-edges of the original graph. The PageRank algorithm is then applied, producing a total

ordering involving all three types of entities.

Yanbe et al. [152] use tags to improve document ranking by combining into the ranking function many features

extracted from tags. Such features include the similarity of the query to tags, as well as the number of user

assignments and their recency. Intuitively, the last two features are indicative of the document’s quality and

freshness, suggesting an authority measure similar to PageRank, but less susceptible to spam (Section 2.1).

Bao et al. [17] adapt a machine learning approach to ranking in the case where the extended documents are

Web Pages. A support vector machine is used to “learn” the ranking function [85] which weighs five different

features of the pages: the tf/idf similarity of the page and the query [106], two different similarity measures

between the query and the tags, its PageRank and the PageRankadaptation that was presented in [76].

Amer-Yahia et al. [7] propose a solution for ranking documents efficiently, under the constraint that only the

assignments posted by users in social network neighborhoods are to be used, thus personalizing query results

based on the social network of the user submitting the query.The proposed technique is a general framework that

can be used in combination with a variety of ranking functions monotonic in the frequency of tags appearing as

query terms.

In Chapter 5 we present our own approach to using tags in orderto improve ranking of extended documents.

Unlike the research presented above, it utilizes tags in a principled probabilistic manner, motivated by our growing

understanding of the social annotation process ([64, 70, 33, 73] presented bellow), while being capable to cope

with the scale and rapid growth of social annotation systems.

Besides ranking, researchers have also looked into other interesting problems related to collaborative tagging,

including facilitating the exploration of the socially annotated document collection. Li et al. [97] organize the tags

in a loose hierarchical structure in order to facilitate browsing and exploration of tags and documents. Ramage et

al. [125] extended the Latent Dirichlet Allocation technique [27] for extracting topics from textual data to include

tags and used it to derive an improved clustering of web pagesinto thematic categories.

Another body of work is concerned with the analysis and modeling of collaborative tagging systems [64, 70,

33, 73]. [64, 70] observed that the distribution of tags assigned to a document converges rapidly to a remarkably

stable heavy-tailed distribution. [64] concentrates on identifying the user behavior that leads to this phenomenon,

while [70] attempts to mathematically model it.

[33] on the other hand, explores and models the co-occurrence patterns of tags across documents. They found

that given a tag, the tags that tend to be used in conjunction with it follow a stable heavy-tailed distribution:

the more specific that tag in question is, the flatter the tail of the distribution. The authors contribute this to a

hierarchical organization of the tags co-occurring with the one singled-out. This observation points to the utility


of tags for exploring relevant collections of extended documents.

Heyman et al. [73] investigate whether the additional information provided by the social annotations has the

potential to improve Web Search and reach mostly positive conclusions. They considered the del.icio.us social

annotation system where users tag Web pages. Among their important positive conclusions is that the Web pages

present in del.icio.us are interesting, fresh and activelyupdated and that judges in their user study found tags

to be both relevant and objective. Their two most significantnegative observations is that the pages present in

del.icio.us cover a tiny fraction of the Web overall and thattags also tend to present in the Web page text.

2.4 Information and Entity Extraction

The use of Information Extraction (IE) technology can expose vast amounts of information embedded within

the textual component of documents and enable profoundly more sophisticated ways of interacting with such

collections of extended documents.

IE systems are (usually sophisticated) algorithms capableof identifying structured information in unstructured

textual data. Recent tutorials [10, 5, 58] provide an excellent overview of the technology supporting information

extraction, both rule-based and machine learning-based. Typically, an instantiation of a particular IE system

is able to identify a single, well specifiedrelation. For example, an IE algorithm can be tuned and trained to

scan news articles and retrieve concert information. Such information can be described by atuplewith schema

〈Band Name, City, Venue, Date〉. A specialized IE application is Named Entity Extraction, whose goal is to

identify mentions of named entities such us people’s names,corporation names, products, etc.

Entities extracted from text enable the Entity Search querying model [42]. The model is motivated by the

observation that many queries issued on a document collection, such as the Web, do not seek a specific document

for browsing, but rather information which can be present inmultiple documents or scattered across documents.

Example queries are “amazon customer service phone” or “university of toronto professors”. The latter query

searches for entities (professors) mentioned in Web pages containing terms “university of toronto”. Hence, in the

Entity Search model, queries are comprised of both desired entity types and keywords. As a response, entities

matching the query and supporting Web pages are presented. In [42], Entity ranking is based on a probabilistic

framework, where documents matching the keyword portion ofthe query and containing a requested entity provide

“evidence” in its favor.

Similarly, [8] utilizes both entities extracted from text and known associations between them to offer sophis-

ticated querying functionality. For instance a user can query for “pet-friendly hotels in a lively city”. The struc-

tured information supporting this query are tuples〈Hotel, City〉. But abstract attributes such as “pet-friendly”

and “lively” cannot be exposed using IE techniques and instead standard information retrieval techniques need to


be applied to retrieve documents related to these terms. Allthese diverse pieces of structured and unstructured

information are efficiently stitched together to provide the top Hotel,City pairs matching the desired attributes, as

well as documents supporting this claim.

Besides querying, entities have also been used for mining textual data. [142, 139] utilize entities extracted from

documents in order to identify “topics” present in a document collection. Topics are detected by first computing

pairs of correlated entities and then further grouping these entity pairs into clusters. The assumption underlying

this approach is that such groups of entities correspond to an underlying event. In Chapter 3 we build upon

the principles presented in [142, 139] to provide interactive exploratory functionality for extended document

collections which in addition to entities, are associated with categorical meta-data attributes.

Unlike other extended document meta-data, the generation of entities and their relationships is not a byproduct

of user activity, but an extremely demanding computationaltask. Each instantiation of an IE algorithm is designed

to extract a particular relation from text. Normally, numerous such IE “black-boxes” need to be applied in order

to extract useful and diverse information from documents. The development and debugging of such IE black-

boxes, the combination and reconciliation of their outputs, as well as their efficient application on large document

collections is a herculean task. Many systems are currentlyunder development whose goal is to manage and

optimize this entire process.

In this spirit, [136] propose an approach for developing a complex IE program using the Datalog language to

combine small and highly specific IE “predicates”. The advantages of this approach are twofold. First, developing

highly specific and targeted IE algorithms allows for more effective and focused development. Second, stitching

together of these small and specific IE operators into a larger and more complex program using Datalog rules

enables their optimized execution: since the high level IE program requires the application of many smaller IE

operators and joining of their output into more complex relations, the order in which they are applied is crucial for

performance. Hence, [136] focuses on the enumeration of possible execution strategies and the use of statistics

and cost models to identify the most efficient one.

While the focus of [136] is on efficiency, the quality of the generated data is also of paramount importance.

IE algorithms offer limited precision and recall. Motivated by this observation, Ipeirotis et al. [82] develop an

optimization framework for the efficient extraction of a single relation from a text collection, at the desired recall

level (% of tuples recovered). They identify that there exist four possible execution strategies for applying an IE

algorithm on a document collection, with unique recall/execution cost characteristics. A sophisticated cost/recall

estimation process is developed and used in the context of anadaptive query execution framework: the system

initiates the execution with what a-priori appears as an optimal plan, and as more accurate statistics about the

corpus are gathered during execution, adaptively switchesto a more efficient plan at the desired recall level.


2.5 Sentiment and User Feedback

In the context of active user participation in on-line activities, it is common for users to express, either explicitly

or implicitly, their views and opinions on products, events, etc. For example, on-line forums such as customer

feedback portals offer unique opportunities for individuals to engage with sellers or other customers and provide

their comments and experiences. These interactions are typically summarized by the assignment of a numerical

or “star” rating to a product or the quality of a service. Numerous such applications exist, like Amazon’s customer

feedback and Epinions. Any major online retailer engages one way or another to consumer-generated feedback.

But even if ratings are not explicitly provided, sentiment analysis tools [145, 120, 119] can identify with a high

degree of confidence the governing sentiment (negative, neutral or positive) expressed in a piece of text, which in

turn can be translated into a numerical rating. This capability enables the extraction of ratings from less formal

reviews, typically encountered in blogs. Extending this observation, such tools can be employed to identify the

dominant sentiment not only towards products but also events and news stories. Virtually any document can be

“extended” by associating it with a rating signifying the author’s attitude towards some event.

Conversely, users not only voluntarily express their opinion, but also actively seek such information made

available by fellow users. Study of feedback provided by complete strangers on-line is an integral part of Internet

users’ decision making process [119]. There is “safety in numbers” that neither a limited number of personal

acquaintances nor professional critics can provide. Hence, facilitating access to this information is a pressing

need.

Aspect summarization enables better understanding of userreviews. Aggregating user ratings or sentiment to

provide an overall rating towards a product, service or event is informative but masks finer granularity patterns.

Most rated items have certain aspects that users like and other aspects that they dislike. A high-rated restaurant

can have “great food”, but “slow service”. The goal of aspectsummarization is to expose these facets together

with the aggregate sentiment towards them. Evidently, aspect summarization is closely related to faceted search

and automatic facet discovery (Section 2.2).

Aspect summarization techniques typically employ machinelearning tools applied off-line at a collection of

reviews. As an example, Yue et al. [103] extract rated aspectsummaries from a collection of extended documents

with ratings as their meta-data. The authors use a variationof Probabilistic Latent Semantic Analysis (PLSA) [75]

to identify aspects as PLSA topics comprised of “modifier” phrases, such as “good service”. Overall document

ratings are then used to associate each aspect with its individual rating.

In Chapter 4 we present a tool that is similar in spirit to the rated aspect summarization technique of [103].

However, “rated aspects” are computed on-the-fly, rather off-line, in response to an ad-hoc keyword query and

suggested as possible expansions of the original query.


Another valuable task is singling-out high quality reviews. [4] consider many document (review) features,

such as text quality, and document meta-data such us hyperlinks and user ratings (votes on the review helpfulness)

in order to develop a classifier capable of identifying high quality documents. A relevant approach is adopted in

[61]. There, besides linguistic features, document sentiment is used to predict its perceived usefulness. Intuitively,

the subjectivity of the document affects its quality.

Evidently, user reviews posted on-line are not only relevant for fellow users seeking information, but for

manufacturers, analysts, etc. Reviews and their associated ratings or sentiment can both influence and predict

sales. [61] develop a model to estimate the effect of a reviewon subsequent sales, taking into account its rating,

its quality and other features. [155] build an autoregressive model which uses sales at timest−1, t−2, . . ., as well

as review ratings, quality and sentiment at these time instances to predict sales at timet. Similarly, [13] develop a

linear regression model to predict movie box office results using micro-blogging messages from Twitter and their

associated sentiment.

2.6 Dynamic Document Collections

Textual content is being generated on-line at a torrential pace (Example 1.5). There is a constant flow of new

documents being added to document repositories. Furthermore, this “stream” of information is non-stationary. Its

focus and characteristics shift over time, in a gradual or abrupt manner.

The mining task of Topic Detection and Tracking [153] seeks to extract time evolving topics from a dynamic

document collection. For instance, the topics mined from a blog post collection can correspond to active discus-

sion about public events. The literature on this area is vast. Some approaches, such as [147] adopt a probabilistic

approach and identify a topic as an evolving word distribution, in the spirit of PLSA [75]. Others, such as [16]

perform at each time period (e.g., single day) a clustering of correlated words or entities. Topics are identified

as persistent word clusters, allowing for changes in cluster structure over time. Recent work utilizes, in addition

to the documents’ textual component and timestamps, information about the underlying social network in the

context of which content is generated [158].

The appearance of a new theme or topic in a dynamic document collection is rarely gradual. Typically, the

emergence of a new topic is signaled by a burst in activity. Kleinberg [91] detects for each word escalating bursts

using a Hidden Markov Model, whose hidden states correspondto levels of “burstiness” for this word. Bursts are

detected by estimating the most likely hidden state sequence from observed word arrival rates. [60] takes this idea

further. After periods of bursty activity for each word are identified, “bursty words” are grouped together into

“bursty events”.

The techniques presented above are off-line and retrospective, i.e., they are applied to a static snapshot of


a dynamic document collection. A more relevant research challenge in modern real-time text streams, such as

micro-blogging services, is on-line detection of events, ideally as early in their genesis as possible.

For instance, [113] introduces a technique for detecting and tracking topics in an on-line fashion. The text

stream is modeled using a mixture of drifting word distributions corresponding to topics. Statistical techniques

are employed to dynamically select the “optimal” number of components in the mixture, i.e., the number of

topics currently present in the most recent documents of thedynamic collection. The need to introduce of a new

component is indicative of the emergence of a new topic in thetext stream.

A different on-line “event detection” problem, unrelated to topic mining, is studied in [110]. Content pub-

lished on-line attracts varying degrees of attention by users. Depending on the context, the attention received is

materialized as links to the document, positive votes, etc., that are spread in time. [110] performson-line and

earlydetection of documents that receive an unexpectedly high degree of attention, based on the attention gather-

ing intensity that has been normal for the document’s source. The decision to pro-actively declare a document as

“attention gathering” is performed using sequential statistical tests.

The applications described above, mine dynamic document collections for patterns, topics and events. How-

ever, the dynamic nature of modern document collections is also redefining search, since on many occasions

documentrecencyneeds to be incorporated into ranking decisions. [52] discuss “breaking-news queries”, for

whom document recency should be a defining feature of result relevance. Intuitively, breaking-news queries are

queries about on-going events. The system presented in [52]detects breaking-news queries using a classifier and

responds appropriately by using an appropriate ranking model for such queries.

In Chapter 6 we introduce a novel technique for querying dynamic collections of extended documents associ-

ated with categorical attributes. The solution recognizesthe extra information content carried by newly generated

documents and responds by identifying among the most recentdocuments, a subset which according to user

preferences is “uniquely interesting”.

2.7 Integrated Document Collections

Information available on-line is extremely heterogeneous. The complications of this heterogeneity are typically

resolved by developing techniques for querying, exploringand mining a coherent subset of the data at a time.

For instance, there exist specialized vertical search engines for querying Web pages3, news articles4, blog posts5,

micro-blog posts6, product reviews7, etc. In addition to textual information, a wealth of structured information

3www.google.com4news.google.com5www.blogscope.net6www.twitter.com7www.epinions.com


is also available on-line, residing in Web accessible “deepweb” databases [23]. Again, vertical search engines

provide access to a subset of this information at a time: products8, flight information9, etc.

Major search engines are responding to this fragmentation.They are evolving from glorified information

retrieval algorithms operating on the corpus of Web pages, to query answering ecosystems acting as a single

access point to all the information available on-line, textual or structured. Depending on the query, data collections

alternative to the Web page index are used to serve information. Perhaps the most simple and familiar example is

a query about the weather, e.g., “toronto weather”. The top result of such a query is actually the current weather

in Toronto and a weather forecast, rather than a link to a Web page. This information is obviously retrieved from

an ancillary data source, perhaps the database of a partner Web site. A large fraction of queries can greatly benefit

from the integration of data from alternative sources into Web search results: queries about products (e.g., “4mp

sony cameras”), reviews (e.g., “good toronto hotels”), flights (e.g., “toronto to chicago flights”).

The central challenge in querying integrated document collections is determining the most relevant document

collections for each query. A widespread approach is to train classifiers and use them to route a web query into

the appropriate document collections or structured data collections. Recent work [11, 12] suggest the usage of

multiple features extracted from the query string (e.g., keywords such as “jobs”, “reviews”, etc.), query log data

and click-through data (queries and clicks that reached a collection) and the document collections themselves

(similarity of query to collection’s text).

Of particular interest is the interaction of Web search withreal-time, dynamic document collections such as

news articles and micro-blog posts. Whether a query is “newsworthy” and would hence benefit from the inclusion

of news articles and micro-blog posts generated in real timedepends on the time that it is issued. It would make

sense to integrate fresh news in the results of a query for “toronto blue jays” around the time of a relevant sports

event, but not otherwise. The approach adopted by Diaz et al.[51] searches for joint bursts of activity in queries

and the news collection. It then speculatively presents relevant news articles in response to the “bursty” queries

and determines whether the query is newsworthy or not based on how often the news articles are preferred over

the regular search results.

Integrating structured data collections into Web search raises additional challenges. Evidently, a query classi-

fication approach could be used [11, 12] in order to determinewhether a structured data source should be utilized

to answer the query. However, such queries could greatly benefit from a deeper structural analysis, which can

additionaly identify which structured attributes are present in the query. For example, besides simply identifying

that query “4mp sony cameras” can be issued to a product database, we would like to identify in particular that

products ofType = Digital Cameraand characteristicsResolution = 4 MP, Brand = Sonyare requested. In Chap-

8http://www.bing.com/shopping9www.kayak.com


ter 7 we further discuss the benefits of such an approach and present an effective and efficient solution for this

task.

Besides querying, text mining applications can benefit fromthe use of multiple, integrated document collec-

tions. For instance, Topic Detection algorithms can be applied to news articles, blog posts or micro-blog post

collectionsindependently. Nevertheless, it would be preferable to identify the same topic in all three document

collections. This would provide both richer context and improved topic detection quality. To this end, recent work

extends the PLSA approach [75] of detecting topics as word distributions, to topics spanning multiple static [157]

or dynamic collections [147, 146].

Another example that demonstrates the benefit of integrating document collections in text mining is offered

by [78]. In traditional document clustering [34] documentsare viewed as bag of words. [78] suggest enhancing

this representation by utilizing the Wikipedia text corpus. In a sense, an extra transitive similarity link between

two documents is established through their similarity to the same Wikipedia documents, which implies a “topical”

connection.

Part I

Exploration

23

Chapter 3

Interactive Exploration of Extended

Document Collections

3.1 Introduction

In this Chapter we introduce techniques that enable the intuitive exploration of extended document collections.

This is accomplished by leveraging two distinct but complementary types of meta-data: mentions of interesting

and relevantentitiesextracted from the documents’ textual component andcategorical document attributes.

Let us illustrate the proposed functionality by considering a collection of blog posts as a concrete example.

One can utilize an entity extractor and obtain entities of interest such as people, locations, products, companies,

etc., mentioned in the posts. Then, posts by bloggers discussing the “Dark Knight” movie, mention also ac-

tor names such as “Heath Ledger” and “Christian Bale”. Postsdiscussing the Canadian “Listeriosis” outbreak

mention the disease in conjunction with “Public Health Agency of Canada” and locations like “Canada” and

“Toronto”. The key observation is that related posts, capturing the samestory, mention approximately the same

group of core entities. By identifying such groups ofstrongly associated entities, i.e., groups of entities that are

recurrently mentioned together, we implicitly detect the underlying event of which they are the main actors.

Besides identifying strong entity associations (and, hence, the underlying events) in the entire post collec-

tion, we would like to utilize the fact that bloggers typically reveal their demographic profile, such as their age,

gender, occupation and location. This information allows us to expose the stories capturing the attention ofeach

demographic segment, such as “people in the US” or “young males in the US”.

Then, relevant entity associations can be presented on demand as we “drill-down” to the demographic of in-

terest, or compared for different demographics. Once an entity-group of interest has been identified, the most

24

CHAPTER 3. INTERACTIVE EXPLORATION OF EXTENDED DOCUMENT COLLECTIONS 25

relevant or influential posts associated with it can be easily fetched and browsed. In this manner, a deeper un-

derstanding of the underlying event can be achieved. For a demonstration of this functionality and its utility see

[9].

To support suchinteractivebrowsing and exploration of the document collection, we need to pre-compute and

materialize strongly associated entities for all attribute value combinations that can be possibly requested. We

refer to the fraction of documents matching a certain attribute value restriction as asliceof the collection (e.g.,

a slice can be a restriction on demographic attributes in thecase of blogs). Depending on the application, two

variations of our core problem are of interest: computing all sufficiently strong entity associations (Threshold

Variation) and computing the top-k strongest entity associations (Top-k Variation), for all the different slices of

the extended document collection. The Top-k Variation is particularly interesting and highly useful. Given that

associations are meant to be browsed, identifying thek most pronounced ones eliminates the need for a detection

threshold that could lead to the computation of too few or toomany associations.

Additionally, note that we used the notion of association among entities in a very generic sense. Depending on

the application context, robust measures ofstatistical correlationor even simpler measures of set-overlap might

be appropriate for quantifying the degree of association between entities. Given the wide applicability of the

proposed functionality,all plausible measures should be supported. Such flexibility enables the use of complex

measures such as theLikelihood Ratiostatistical test [148], whose unique properties and behavior [54] render it

ideal for exposing interesting and meaningful entity associations in user-generated content.

The relentless pace at which user-generated content is accumulated, combined with the need to analyze up-to-

date data, necessitatehighly efficient solutionsfor both the Threshold and the Top-k variations of the problem. In

order to address this challenge, we make the following contributions.

• We develop algorithm THR-ENT for addressing the Threshold variation of the problem. THR-ENT elimi-

nates from consideration provably weak associations.

• We develop algorithm TOP-ENT for addressing the Top-k variation of the problem. TOP-ENT supports

early termination as soon as it can guarantee that the top-k associations computed so far constitute the final

result.

• THR-ENT and TOP-ENT are designed with the explicit goal of supporting virtuallyany association measure,

no matter how complex.

• We identify and exploit computation sharing and optimization opportunities, as well as accuracy/efficiency

trade-offs offered by the overlap among slices.


• We demonstrate the efficiency and applicability of the proposed techniques using both synthetically gener-

ated and real world data, comprised of 1.4M blog posts pre-processed by a custom entity extractor.

Part of the work presented in this Chapter also appears in [130] and is organized as follows: In Section 3.2 we

discuss existing work. In Section 3.3 we formally define the problems we need to address. Section 3.4 presents

our core algorithmic techniques and Section 3.5 presents the infrastructure required for their application. Further

optimization opportunities and trade-offs are explored inSection 3.6. Section 3.7 discusses useful extensions of

our techniques. In Section 3.8 we evaluate the performance of the proposed solutions and highlight the usefulness

of the Likelihood Ratio test. We conclude in Section 3.9.

3.2 Comparison to Existing Work

The core problem of identifying strong entity associationsis related to association rule mining and set-similarity

search. In association rule mining [6, 30], a collection of item-sets (extended documents) is mined in order to,

essentially, identify frequently co-occurring items (entities). In set-similarity search [44, 19, 150], a collection of

sets (entities) is probed by a query-set, and collection-sets with sufficiently high overlap are retrieved. However,

existing techniques from these two domains cannot satisfy the significantly richer requirements of our application.

First, the proposed techniques support virtually any measure of association between entities. Existing algo-

rithms are tailored around a single measure, e.g. support/confidence [6], theX 2 test [30] or simple set-similarity

measures [19, 150]. The flexibility offered by the solutionsintroduced enable the use ofcomplex, non-linear but

robustassociation measures like the Likelihood Ratio test. As demonstrated before [54] and advocated in Section

3.8, theuniqueandintuitive behavior of the Likelihood Ratio renders it ideal for use in our setting. To the best

of our knowledge, no previous algorithm is general enough tosupport the Likelihood Ratio or any other plausible

measure of association.

Second, the proposed techniques support the efficient computation of thek strongest entity associations. Such

functionality is extremely powerful as it eliminates the need to set a sensitive detection threshold that could lead

to the detection of too few or too many associations.

Third, we explore in depth and exploit computation sharing opportunities arising from the need to compute

entity associations for all the slices of the extended document collection. These traits further distinguish the

solutions introduced from previous work on association rule mining and set-similarity search.

In the context of Topic Detection and Tracking, [142, 139] utilize entities extracted from the documents

in order to identify topics. Topics, for a single day, are detected by first computing pairs of entities found to

be associated using theX 2 test and then further grouping these entity pairs into entity clusters. Our approach

extends the ideas of [142, 139]. In addition to entities, we leverage categorical document attributes to identify


entity associations in all the slices of the extended document collection and, thus, enable deeper understanding of

the content and enhanced exploratory functionality. Furthermore, we focus on efficiency and applicability of the

proposed techniques to massive document collections, an issue left unaddressed in [142, 139].

3.3 Formal Problem Statement

Consider a collectionD of n extended documentsd1, . . . , dn whose meta-data are a set of entities and categorical

attributes. Each extended document is associated withl attributes denoted withA1, . . . , Al. We denote with

di(A1, . . . , Al) the meta-data attribute values annotatingdi. To ease the exposition of our ideas (and without

sacrificing generality) we assume that the attribute domainsDom(A1), . . . , Dom(Al) are unordered. In general,

the domains can be partially ordered, e.g., they can form a hierarchy.

Let A ∈ Powerset(A1, . . . , Al) be a subset of thel attributes andA× be the cartesian product of their cor-

responding domains. For example,A = A1, A2, andA× = Dom(A1) × Dom(A2). SetA× is essentially

comprised of all the value combinations of the attributes contained inA. Each elementa of setA× defines aslice

sD(a) of the extended document collection: the slice is comprisedof the documents whose attribute values match

those ina, i.e.,sD(a) = d ∈ D|a ⊆ d(A1, . . . , Al).

Example 3.1. Our extended document collection is comprised of blog postsassociated with two attributes spec-

ifying the blogger’s demographic profile: (A)ge and (G)ender, so that (A)ge= young,old and (G)ender=

male,female. The two meta-data attributes and their domains define nine distinct slices of the collection:

(young,male), (young,female), (old,male), (old,female), (young), (old), (male), (female) and (), where the last

slice is essentially the entire post collection.

The application of an entity extraction algorithm onD reveals the mentions ofm distinct entitiese1, . . . , em

in the documents1. The set of entities mentioned in a document are part of its meta-data. It will be convenient

to represent the available information with respect to the occurrences of entities in documents by means of an

entity-document occurrence matrixOm×n.

Definition 3.1. The entity-document occurrence matrixOm×n is a binary matrix such that elementoij = 1 only

if entityei is mentioned in extended documentdj .

Thedocument-listei = dj |dj mentionsei of entityei corresponds to thei-th row of matrixO. Respectively,

theentity-listdj = ei|ei mentioned indj of documentdj corresponds to thej-th column of the matrix. The

1Certain classes of entity extraction algorithms attach aconfidencevalue to each entity. For those algorithms, we assume that anappropriatethreshold has been selected and only the entity matches withconfidence exceeding the threshold are reported.


document-lists and entity-lists are essentially row-oriented and column-oriented representations of the sparse

matrixO.

We denote withci = |ei| the number of documents that mention entityei and withcij = |ei ∩ ej | the number

of documents where entitiesei andej are mentioned together. Lastly, we denote withesi the documents where

entityei occurs,in the context of a specific slices, i.e.,esi = dj ∈ s|dj mentionsei = ei ∩ s.

At the core of the proposed solution lies the need to materialize strongly associated groups of entities for all the

slices of the document collection. In what follows, we concentrate on the computation of associated entitypairs

(i.e., groups of two entities). The definitions and techniques introduced are extended for associations involving an

arbitrary number of entities in Section 3.7.

There exists a wide range of alternatives that can be appliedto assess the degree of association between

entities. Some of the most commonly used measures of association can be broadly classified into two categories:

statistical correlation measures and set-similarity measures2.

Statistical measures of correlation treat entities as random variables and assess their association using a statis-

tical hypothesis test[148]. The base hypothesis typically used is that two entitiesei andej occur in the relevant

slice of the collection independently of one another, i.e.,if pi (pj) is the probability that a document containsei

(ej) andpij is the probability that a document contains both entities, thenpij = pipj . For significantly associ-

ated entities, co-occurring frequently in documents, we have thatpij ≫ pipj . Statistical tests produce numerical

values which can be mapped to the likelihood that, for the entities examined, the assumption ofindependentoc-

currence in documents does not hold. Higher values indicatea higher likelihood that the assumption is violated

and therefore signify stronger association. Two of the mostcommon statistical hypothesis tests are theX 2 test

and the Likelihood Ratio test [148].

Consider two entitiesei andej and a collection comprised ofn extended documents. We denote withN11

the number of documents actually containing both entities.In a similar manner, letN10 (N01) be the number of

documents containing entityei (ej) but not entityej (ei) andN00 the number of documents containing neither

entity. We also denote withE11, E10, E01, E00 theexpectedvalues of these quantities under the independence

assumption for entitiesei andej . The values of both the observed and expected quantities canbe easily expressed

as a function of the sizen of the underlying document collection, the occurrencesci, cj and co-occurrencecij of

the two entities.

• X 2 : X(ei, ej) =∑

x∈0,1

∑

y∈0,1

(Nxy − Exy)2

Exy

• Likelihood Ratio:L(ei, ej) =∑

x∈0,1

∑

y∈0,1

2Nxy lnNxy

Exy

2Probability metrics is a third popular category [62]. Such measures are also supported by the techniques subsequently introduced.


Set-similarity measures of association treat entities as the sets of documents where they appear, and attempt

to quantify entity association as set overlap. Perhaps the most widely used set-similarity measure is the Jaccard

Coefficient, although other measures, like the Dice Coefficient are also used [105].

• Jaccard Coefficient:J(ei, ej) =|ei∩ej ||ei∪ej |

=cij

ci+cj−cij

• Dice Coefficient:D(ei, ej) =2|ei∩ej ||ei|+|ej |

=2cijci+cj

By carefully inspecting the aforementioned, as well as other, measures for assessing the degree of association

of an entity pair, we observe a number of shared mathematicalproperties. These properties capture a series

of intuitive characteristics that one would expect from a measure of association and strength of co-occurrence

between two entities.

Definition 3.2. An association measureM is a real functionM(ei, ej) = M(ci, cj , cij) of three variablesci =

|ei| (occurrences of entityei), ci = |ej| (occurrences of entityej) andcij = |ei ∩ ej | (co-occurrence of entities

ei andej). The function has the following properties.

• Property 1: Higher values ofM(ci, cj , cij) signify stronger association.

• Property 2:M is symmetric with respect to the two entities, i.e.,M(ci, cj , cij) = M(cj , ci, cij)

• Property 3: Ifc′ij > cij thenM(ci, cj, c′ij) > M(ci, cj , cij).

• Property 4: For fixedci, cj , M(ci, cj, cij) is maximized forcij = min(ci, cj).

• Property 5: Ifc′i > ci thenM(c′i, cj , cij) < M(ci, cj , cij).

The aforementioned definition attaches many intuitive properties to an association measure. The measure

has to be symmetric (Property 2), so that there exists a unique association value between two entities. Since by

association we imply frequent entity co-occurrence, Property 3 guarantees that increased co-occurrence frequency

results in stronger association. Property 4 is a direct consequence of Property 3: association is maximized when

the less frequent entity appearsalwaysin the context of the more frequent one. Lastly, Property 5 states that

for two entitiesei andej, if we fix the number of occurrences forcj (ci) of entity ej (ei) and the number of

their co-occurrencecij , then increasing the number of occurrencesci (cj) of entityei (ej) reduces the association

strength. This property captures the fact that more frequent entities should also co-occur more frequently in order

to be strongly associated.

After deciding on a measure of association that is appropriate for our application (Section 3.8), we need to

compute pairs of entities associated under this measure, for all the slices of our document collection. We identify

two interesting variations of this basic problem.


Problem 3.1. (Threshold Variation). Consider a collectionD comprised of extended documents associated with

l meta-data attributes and pre-processed by means of an entity extraction algorithm that identified the mentions

of entitiese1, . . . , em. For each slices of the document collection, we want to identify the pairs(ei, ej) such that

M(ei, ej) ≥ T .

Problem 3.2. (Top-k Variation). Consider a collectionD comprised of extended documents associated withl

meta-data attributes and pre-processed by means of an entity extraction algorithm that identified the mentions of

entitiese1, . . . , em. For each slices of the document collection, we want to identify thek pairs (ei, ej) with the

highestM(ei, ej) values.

The solutions to Problems 3.1 and 3.2 that we subsequently propose are designed with the explicit goals of

being highly efficient and of supporting virtually any plausible measure of association (Definition 3.2). Neverthe-

less, we will argue that for the applications we described inSection 3.1, the use of the Likelihood Ratio measure

results in the discovery of uniquely interesting and meaningful associations. We postpone this discussion until

Section 3.8, in order to maintain our focus.

3.4 Intra-Slice Algorithms

Let us focus our attention on a single slice of our collection. To ease notation, we assume that the slice is comprised

of n extended documentsd1, . . . , dn which mentionm entitiese1, . . . , em. Our goal is to compute either the entity

pairs(ei, ej)whose association valueM(ei, ej) exceeds a desired thresholdT (Threshold Variation, Problem 3.1),

or thek entity pairs with the highest association values (Top-k Variation, Problem 3.2).

In Section 3.4.1 that follows, we discuss a basic solution that attempts to address both Problems 3.1 and 3.2

by efficiently computing the association value ofeveryentity pair(ei, ej). This technique forms the foundation

upon which we build algorithm THR-ENT (Section 3.4.2) and algorithm TOP-ENT (Section 3.4.3) for addressing

Problems 3.1 and 3.2 more efficiently.

3.4.1 Evaluating All Pairs

We assume that we have at our disposal, in main memory, the document-listsei associated with each entityei.

In order to assess the strength of the associationM(ei, ej) of an entity pair(ei, ej), we need three pieces of

information: the number of occurrencesci = |ei| of entity ei in the documents, the number of occurrencescj of

entity ej = |ej| and, finally, the numbercij = |ei ∩ ej | of their co-occurrence(i.e., documents containing both

entities). The main challenge in evaluatingM(ei, ej) is computing the co-occurrence of two entities. The values

ci andcj are much more accessible in comparison; they are simply the sizes of the corresponding document-lists.


Consequently, evaluating the association of every pair(ei, ej), requires the co-occurrencecij of all possible

entity pairs. This information is represented by means of aco-occurrence matrixC.

Definition 3.3. The co-occurrence matrixCm×m is a symmetricinteger matrix such that elementcij is equal to

the number of co-occurrences between entitiesei andej .

Notice thatC = OOT (Definition 3.1). Due to the symmetry of association measures (Definition 3.2, Property

2), we are only interested in elements ofC that lie below the matrix diagonal. The relevant elements ofthe matrix

are depicted in Figure 3.1(a).

The obvious approach to compute the elements of matrixC would be to process oneelement-at-a-time.

The co-occurrencecij between two entitiesei andej can be provided by intersecting their document-lists, i.e.,

cij = |ei ∩ ej |. However, a significantly more efficient approach is to compute the elements ofC onerow-at-a-

time.

A useful algorithm needed for our subsequent discussion, mergesw sorted listsl1, . . . , lw into a single list of

sorted elements. For simplicity we assume that the list elements are integers sorted in ascending order.

Example 3.2. Consider two lists with elements[1, 3, 5, 7] and [2, 4, 5, 6, 7, 8]. Their merge generates the sorted

list of elements[1, 2, 3, 4, 5, 5, 6, 7, 7, 8]. Notice that elements with the same value appear consecutively and hence

we can compute how many times an element appears in the two merged lists by observing consecutive element

occurrences in the final list.

The merging algorithm usesw iterators to access the lists in sorted order. A binary heap is used to maintain in

order the elements of thew lists currently pointed by the iterators. By popping the topelement from the heap we

derive the smallest element not yet extracted from the lists. After such a removal operation, the heap is updated

by inserting the next element generated by the same iteratoras the removed element. The cost of extracting the

next smallest element from the binary heap isO(logw). Therefore, the total cost of processing all the elements

in thew lists and producing them in sorted order isf logw∑w

i=1 |lw|, wheref is a cost constant.

The row-at-a-time computation ofC begins from the row corresponding to entitye1 (Figure 3.1(a)) and

progresses towards the row of entityem. Suppose that we need to compute the row of entityei. The elements of

the row correspond to the co-occurrences of entityei with all preceding entitiese1, . . . , ei−1. The key observation

is that the entire row can be computed by merging thesortedentity-listsd corresponding to the documents where

entityei appears.

Example 3.3. Suppose that the document-list for entitye9 is e9 = [1, 5, 8] (e9 appears in documentsd1, d5

and d8). Respectively, the entity-lists for documentsd1,d5 and d8 are d1 = [2, 3, 6,9], d5 = [3, 7,9] and

d8 = [2, 3, 7,9] (e.g., documentd1 contains entitiese2, e3, e6, in addition toe9). Merging the entity-lists of


the documents where entitye9 occurs, we obtain the other entities that occur in those documents. The element

sequence resulting from the merge of the three lists is[2, 2, 3, 3, 3, 6, 7, 7,9,9,9]. This implies that entitye2

appears twice in the documents where entitye9 occurs and thereforec92 = 2. Similarly, we compute thatc93 = 3,

c96 = 1 andc97 = 2. We are then sure that all other elements of rowe9 are equal to zero.

e1

e2

e3

e4

e5

e6

e7

e8

e1 e2 e3 e4 e5 e6 e7 e8

(a)

1 2 3 4

6 5 4 3 2 1

e1

e2

e3

e4

e5

e6

e7

e8

e1 e2 e3 e4 e5 e6 e7 e8

(b)

Forward Merge

Reverse Merge

Figure 3.1: The co-occurrences matrixC.

Intuitively, by computingC in a row-at-a-time fashion we search the entity-lists of alldocuments in whichei

appears and focusonlyon the entities that actually co-occur withei. It is not hard to demonstrate that the cost of

computing the co-occurrencecij between two entitiesei andej isO(cij log |ei|) using row-at-a-time computation

andO(|ei| + |ej |) using element-at-a-time computation. Given the expected sparseness of occurrence matrixO

and heavy-tailed distribution of entity frequencies (mostentities mentioned in a few documents), substantial

improvement over the element-at-time computation ofC is realized.

The entity-listsd are essentially an inverted index of the entity document-listse. The document-list corre-

sponding to entityei contains the documents whereei occurs, while the inverted index maps documents to lists

of entities contained in them. This inverted index is incrementally populated as entities (rows of matrixC) are

processed, or is made available before the processing ofC begins. We denote withIi such an inverted index,

populated with entitiese1, . . . , ei.

In what follows, we assume an incrementally constructed index. Inserting an entityei entails identifying the

appropriate entity-lists whereei should be placed, and inserting at the end of those lists. This index growing

process, in conjunction with the fact that entities are processed in ascending identifier ordere1, . . . , em, implies

that the index’s entity-lists are also maintained sorted inascending entity identifier order.

This observation has an interesting consequence that we utilize going forward. Consider probing the inverted

index Ii−1 during the computation ofC ’s row corresponding to entityei. Merging the appropriate entity-lists

provides the co-occurrences in awell-definedorder, i.e.,ci1, ci2, . . . , ci(i−1) (left to right). Hence, we first derive

the co-occurrence of entityei with entitye1 (if any), then withe2 and so on. We refer to the merge operation that


produces co-occurrences in this order asforward merge(Figure 3.1(b)).

However, this is not the only possibility. Although the index’s inverted lists contain entities in ascending

identifier order, we can process the lists by scanning thembackwards. Thisreverse merge(Figure 3.1(b)) produces

the co-occurrences of an entityei with its preceding entitiese1, . . . , ei−1 in the orderci(i−1), ci(i−2), . . . , ci1 (right

to left). We note that in either case (forward and reverse merge) only non-zero co-occurrences are produced.

Algorithm 3.1 formally presents the row-at-a-time computation of the association of all entity pairs.

Algorithm 3.1 Computing the association of all entity pairsInput : Document listse1, . . . , em

I0 = ∅

for i = 1 tom do

InitializeMergei Forward Merge of rowi

while Mergei.hasNext do

cix = Mergei.getNext

ComputeM(ei, ex) = M(ci, cx, cix)

end while

Ii ← Ii−1 ∪ ei

end for

3.4.2 Problem 1: Threshold Variation

We abstracted the computation of the association strength for all entity pairs as the computation of co-occurrence

matrixC. However, given our interest only in pairs whose association exceeds a desired thresholdT , computing

the entire matrix is wasteful. We need to focus our attentionin the computation of only the part ofC that

corresponds to entity pairs whose association can exceedT . To do so, we leverage our understanding of the

association measure properties (Definition 3.2).

Property 4 of an association measure provides an upper boundon the association value of a pair,without the

need to know the co-occurrence of the pair: the association value of two entitiesei andej is maximized when

the less frequent one appears always in the context of the more frequent one, in which casecij = min(ci, cj).

Therefore, we derive an upper boundP (ei, ej) on the association value of a pair, using only information with

respect to the frequencies of the two entities. We refer to this upper bound as thepotentialof a pair. We assume


that the information about the entity frequencies is available in an array[c1, . . . , cm], sorted in the same order as

the entities and hence easily accessible.

Definition 3.4. The potentialP (ei, ej) of an entity pair(ei, ej) is an upper bound on its association value

M(ei, ej) ≤ P (ei, ej) and is computed asP (ei, ej) = M(ci, cj,min(ci, cj)).

This observation can be utilized to only compute the values of matrix C elementscij whose corresponding

pairs have the potential to exceed the association threshold, i.e., compute the occurrences of pairs(ei, ej) such

thatP (ei, ej) ≥ T .

Let us concentrate on a single row of matrixC, corresponding to entityei. One could compute the potential

of every pair in the row and identify the ones that can possibly exceed the association threshold. This approach

suffers from two disadvantages. First, evaluating the potential for all pairs in the row has a non-trivial overhead,

especially for statistical measures of correlation like the Likelihood Ratio test (Section 3.3). Second, this will

reveal a “fragmented” image of the row elements that need to be evaluated.

As a concrete example, consider Figure 3.2(a). The matrix elements corresponding to pairs that have the

potential to exceed the association threshold are highlighted in grey. Notice that the elements can appear scattered

in the matrix. This immensely complicates matters, since the efficient row-at-a-time approach of computing row

elements generates those elements in a well specified order (either left to right or right to left, Figure 3.1(b)).

Consequently, we need to compute all row elements sequentially, until we reach the last element that is needed.

In the worst case, the entire row will be processed (e.g., rowe5 in Figure 3.2(a)). In order to remedy this situation,

we can leverage Property 5 of association measures.

e1

e2

e3

e4

e5

e6

e7

e8

e1 e2 e3 e4 e5 e6 e7 e8

(a)

e1

e2

e3

e4

e5

e6

e7

e8

e1 e2 e3 e4 e5 e6 e7 e8

(b)

Figure 3.2: Principles of Algorithm THR-ENT.

Lemma 3.1. Suppose that entitiese1, . . . , em are sorted in descending order of their frequencies, i.e.,ci ≥ cj for


i < j. Then, for any entityei and its preceding entitiese1, . . . , ei−1 it holds that:

P (ei, e1) ≤ P (ei, e2) ≤ · · · ≤ P (ei, ei−1)

Proof. We have thatcx ≥ cy ≥ ci for all x < y < i. The maximum value ofcxi and cyi is ci (Property

4). Therefore,M(ci, cx, ci) ≤ M(ci, cy, ci) due to Property 5. However,M(ci, cx, ci) = P (ei, ex). Therefore,

P (ei, ex) ≤ P (ei, ey).

Lemma 3.1 implies that if entitiese1, . . . , em are sorted in descending order of their frequencies, the potential

of the elements in a row is monotonically increasing (non-decreasing) from left to right. Consequently, a threshold

T divides a row into two sections: elements on the left that do not have the potential to exceed the threshold and

elements on the right that have the potential to exceed it (Figure 3.2(b)). More formally, for every entityei, there

exists an entityeT , such that:

P (ei, e1) ≤ · · · ≤ P (ei, eT−1) ≤ T ≤ P (ei, eT ) ≤ · · · ≤ P (ei, ei−1)

Pre-sorting the entities according to their frequencies alleviates both of the problems we encountered before.

First, we do not have to evaluate the potential of every pair in a row. Given that in a row, pair potentials form a

monotonic sequence, we can perform abinary searchin order to locate the first row element that exceeds threshold

T . The ability to locate the relevant part of the row with only few potential evaluations is highly beneficial for

complex, computationally demanding association measures, like the Likelihood Ratio. Second, the row elements

that need to be computed are neatly concentrated on the rightside of the row. Hence, we can access only those

elements using areverse merge(Figure 3.2(b)).

It should be noted that even among the elements whose corresponding pair potential exceed thresholdT , only

those withcij > 0 will actually be computed (due to the row-at-a-time approach). For instance, in Figure 3.2(b)

elementc64 might actually never be encountered.

These techniques constitute algorithm THR-ENT presented in Algorithm 3.2. Algorithm THR-ENT efficiently

computes all entity pairs exceeding the desired association threshold by (a) eliminating from consideration entity

pairs that do not have the potential to exceed the threshold and (b) not considering entity pairs withcij = 0.

The latter property (b) is a consequence of computing matrixC elements in a row-at-a-time fashion, that only

generates non-zero entity co-occurrences.

3.4.3 Problem 2: Top-k Variation

Let us now concentrate on addressing Problem 3.2 within a slice, i.e., identifying thek entity pairs with the

highest association values. An efficient algorithm design paradigm for top-k problems is Thresholding. The


Algorithm 3.2 Algorithm THR-ENT

Input : Document listse1, . . . , em, frequenciesc1, . . . , cm, thresholdT

ResultR = ∅

I0 = ∅

Sort entities by frequencies, such thatci ≥ cj for i < j

for i = 1 tom do

Binary search to findl such thatP (ei, el) ≥ T andP (ei, el−1) < T

InitializeMergei Reverse merge of rowi

while Mergei.nextElement ≥ l do

cix = Mergei.getNext

ComputeM(ei, ex) = M(ci, cx, cix)

If M(ei, ej) ≥ T add(ei, ej) toR

end while


end for

return R

underlying principle of “Threshold algorithms” is evaluating the score of candidate objects in descending order of

their maximum achievable score [56]. Then, object score evaluation can stop as soon as the maximum achievable

score of any object not yet encountered is lower than those ofthek highest scoring objects encountered so far.

Similarly, we use thepotential(Definition 3.4) of entity pairs in order to devise a “Threshold algorithm”. We

generate candidate entity pairs in descending order of their potential and evaluate their association in that order.

Then, as soon as the potential of the next candidate entity pair is less than the top-k “threshold”, i.e., the minimum

association among the top-k entity pairs, we can be sure that the top-k pairs encountered so far are the overall

top-k pairs.

As in the design of the THR-ENT algorithm for Problem 3.1, there are two keys in designing a successful

algorithm for Problem 3.2. First, candidate entity pairs need to be generatedincrementallyin descending order

of their potential. Computing the potential of every possible entity pair, just to determine the order in which their

true association value is evaluated, would be highly inefficient. Second, the candidate generation mechanism

should be compatible with the efficient row-at-a-time approach to computing entity co-occurrences: otherwise,


the benefits of evaluating the association of only a fractionof the entity pairs are negated by the computation of

co-occurrences in an element-at-a-time fashion.

Both of these, seemingly contradictory, objectives are satisfied by pre-sorting, as before, entities in descending

order of their frequencies. Consider again co-occurrence matrixC. Sorting entities in that manner imposes a well-

specified ordering of entity pairs according to their potential, for each matrix row. Lemma 3.1 presented before

guarantees that for the row of entityei, the corresponding potential of matrix elements is monotonically non-

decreasing from left to right. Figure 3.3(a) attempts to illustrate this observation for the entire co-occurrence

matrix C: darker shades of gray signify higher pair potential. The direction of the arrow is the direction of

decreasing pair potential. Note that high potential elements are concentrated towards the matrix diagonal.

However, Lemma 3.1 provides us with a guaranteed potential ordering within each matrix row, but not for the

entire matrix. Given two matrix elements (entity pairs) in different rows, we can draw no conclusions about their

relative potential. Still, a total pair ordering can be derived byprogressively mergingthe rows of matrixC. This

operation will result in an stream of candidate pairs ordered in descending order of their potential.

e1

e2

e3

e4

e5

e6

e7

e8

e1 e2 e3 e4 e5 e6 e7 e8

(a)

e1

e2

e3

e4

e5

e6

e7

e8

e1 e2 e3 e4 e5 e6 e7 e8

(b)

Frontier 1

Frontier 2

Frontier 3

Figure 3.3: Principles of Algorithm TOP-ENT.

Let us discuss this process in more detail. Each of them rows of matrixC is associated with an “iterator”.

The iteratorITi associated with rowi (entity ei) provides the pairs corresponding to its row in descending order

of their potential, e.g.,ITi will first provide pair(ei, ei−1), then(ei, ei−2) and so on (potential decreases from

right to left in a row). At any time, an iterator is associatedwith the potential of the pair that it currently points.

The iterators are inserted in a heapH which maintains them in descending order of their corresponding

potential.Then, the top of heapH provides the iterator pointing to the pair with the maximum potential overall:

it is the pair with the maximum potential among allm matrix rows. After the maximum potential pair is retrieved,

the iterator that provided the pair is advanced to the next pair of its row and associated with its potential. Finally,

heapH is updated to bring to its top the new maximum potential iterator.


This process of generating entity pairs in descending orderof their potential is entirely compatible with the

row-at-a-time approach for computing co-occurrences: theelements from each row are still requested at the order

offered by a reverse merge (right to left). One way to visualize the entire process is that there arem reverse merges

taking place in parallel, one reverse merge for every row. But from each merge, only a single co-occurrence count

(element of matrixC) is computed at a time. HeapH dictates how to alternate between them underlying reverse

merges after each element computation.

The proposed algorithm “sweeps” the elements of co-occurrence matrixC using a frontier. This is depicted

in Figure 3.3(b). The frontier connects the elements from each row currently in heapH. All the matrix elements

on the left of the frontier have lower potential than those computed so far, i.e., elements in the right of the frontier.

Figure 3.3(b) presents this frontier at three phases duringthe algorithm’s execution.

The threshold process described above is implemented by algorithm TOP-ENT, presented in Algorithm 3.3.

Note that the algorithm has an initialization phase where the entire inverted index is populated and the reverse

merges initialized. In this phase, heapH is also populated with the row iteratorsITi.

This efficient threshold algorithm computes the correct top-k result by (a) evaluating only the association of

pairs whose potential exceeds thetrue top-k threshold, i.e., the minimum association value of the final top-k result

and (b) ignores all pairs that never co-occur by computing co-occurrences in a row-at-time fashion.

3.5 Data-Assembly Infrastructure

Algorithms THR-ENT and TOP-ENT allow us to efficiently address Problems 3.1 and 3.2 respectively, but for

a single sliceof the extended document collection. The algorithms need tobe applied onevery sliceof the

collection. In what follows, we design an infrastructure that facilitates the application of our algorithms on the

slices.

The application of the entity extraction algorithm on collectionD reveals entity mentions within documents,

represented by occurrence matrixO (Definition 3.1). In reality, the entity extractor associates each extended doc-

umentdj with setdj of entities mentioned in the document. Essentially, the extractor outputs a column-oriented

representation of the sparse matrixO: theentity-listdj associated with documentdj is the sparse representation

of columnj of matrixO. These entity-lists along with the relevant attribute values of each extended document

form the base data that we need to manipulate, in order to address Problems 3.1 and 3.2.

However, this data format is inappropriate for the application of algorithms THR-ENT and TOP-ENT. Their

application requires thedocument-listses associated with the entities, i.e., a row-oriented representation of the

occurrence matrixO.

Deriving the desired entity-oriented representation of matrix O from the document-oriented output of the


Algorithm 3.3 Algorithm TOP-ENT

Input : Document listse1, . . . , em, frequenciesc1, . . . , cm

HeapH = ∅

Current Top-k resultTopk = ∅

Inverted indexI0 = ∅

Sort entities by frequencies, such thatci ≥ cj for i < j

for i = 1 tom do

InitializeMergei Reverse merge of rowi

InsertITi inH


end for

whileH is not empty ANDH.maxPotential > Topk.threshold do

ITx = H.top. Let (ex, ey) be the pair pointed byITx

cxy = Mergex.next().

ComputeM(ex, ey) = M(cx, cy , cxy)

UpdateTopk if necessary

AdvanceITx to point to pair(ex, ey−1) and updateH

end while

return Topk

entity extractor is a simple operation. However,it is an operation that needs to be performed for every slices.

A slice s is a subset of the documents of the collection and corresponds to a subset of the columns of matrix

O. Therefore, the document-listesi associated with entityei in the context of slices is only a subset of its entire

document-listei.

The significant number of slices associated with the document collection necessitates an efficient infrastructure

for assembling the document-lists required for the application of algorithms THR-ENT and TOP-ENT in each slice.

Such a solution can be developed by exploiting the significant overlap between the collection slices.

Example 3.4. Consider again a collection of blog posts, associated with meta-data attributes (A)ge=young,

old and (G)ender=male, female. The attributes and their domains define 9 slices of the collection. However,

the slices are not unrelated. Slice (young) is theunion of disjoint slices (young, male) and (young, female).

Intuitively, posts authored by young people is the union of posts authored by young males and young females.


Similarly, slice (male) is theunionof disjointslices (young,male) and (old,male).

As the above example illustrates, most of the considered slices can be viewed as unions of otherdisjointslices.

Consider a slices that is the union of two disjoint slicess1 ands2, i.e.,s = s1 ∪ s2 ands1 ∩ s2 = ∅. Then, the

document-listesi of entity ei for slices is simply theconcatenationof the corresponding document-listses1i , es2i

for slicess1 ands2. We refer to the process of concatenating the document-lists from disjoint slices in order to

generate the lists of a new slice asslice merging.

Therefore, there is no need to manipulate the base data for every slice considered. Instead, by appropriately

scheduling slice processing the information required for aslices, i.e., the document-listses associated with the

entities, is assembled by simply merging slices that were processed previously.

Additionally, we do not have to re-invent a solution to this scheduling problem as it is sufficiently similar

to the problem of materializing an OLAP cube [3, 128]. Our document collection slices and OLAP cubecells

are similar in the sense that they both correspond to attribute value combinations, although slices are much more

elaborate structures, associated with considerably more information (document-listses) than cube cells (just a

numerical value). This can be addressed by a straightforward adaptation of the MEMORY-CUBE algorithm [128].

The algorithm groups slices into blocks and defines “paths” of such blocks. The properties maintained by

those “paths” are that (a) the slices in a given block can be constructed by only merging slices present in the

previous block in the path and (b) it is trivial to identify the slices from the previous block that need to be merged

in order to generate a new slice.

youngmale

youngfemale

oldmale

oldfemale

youngmale

youngfemale

oldmale

oldfemale

Sorted Documents

young

old

male

female

old

Block (A,G) Block (A)

Block (G)

Block ()

Figure 3.4: Incremental slice assembly infrastructure.

The framework for progressively assembling and processingslices is best illustrated by means of an example.

Example 3.5. As before, our extended document collection is comprised ofblog posts associated with attributes


describing the blogger’s (A)ge and (G)ender. The first step is to sort the posts so that they are grouped together

according to their attribute value combination (Figure 3.4). Each such consecutive block of posts corresponds

to a slice. We therefore use the entity-listsd associated with the posts in order to generate the document-lists

required by the slices. At this point, the documents are no longer needed. All other slices can be assembled

(Figure 3.4) by merging other slices. The algorithm produces two paths, namely Block(A,G)→Block(A)→Block()

and Block(A,G)→Block(G).

A final observation about the infrastructure proposed to incrementally generate document slices is that every

slice in a path will only participate in a single merge. This bounds the amount of memory required while pro-

cessing a path (at least for maintaining the document-lists), since a slice can be discarded as soon as it used in a

merge. The memory requirements need not exceed the memory required to maintain the sparse representation of

matrixO.

3.6 Using Inter-Slice Information

The incremental slice generation infrastructure presented, raises additional optimization opportunities and trade-

offs. A slice s is generated by mergingr pairwise disjoint slicess1, . . . , sr processed previously, i.e.,s =

s1 ∪ · · · ∪ sl andsi ∩ sj = ∅ for all i 6= j. A slice merge involves the concatenation of the corresponding

document-lists ofl slices. However, this process offers the opportunity to propagate additional information during

the merge ofs1, . . . , sr, which could be used to improve the efficiency with which associations are computed in

the slices generated.

The most plausible source of such additional knowledge are the entity associations computed for slices

s1, . . . , sr. Ideally, the associations computed in slicess1, . . . , sr can be used to significantly reduce the “search

space” of possible entity associations in slices, without sacrificing correctness. We investigate the conditions

under which such pruning is possible, for both Problems 3.1 and 3.2.

3.6.1 Problem 1: Threshold Variation

Consider a slices resulting from the merging of pairwise disjoint slicess1, . . . , sr. We refer to thoser slices

as theancestorsof s. Each ancestral slice is associated with the set of entity pairs whose association value

exceeded auniversalthresholdT . Let Rsi be the set of sufficiently associated entity pairs in slicesi. There

are two alternatives for utilizing setsRs1 , . . . , Rsr in order to constrain the search space in which we search for

associations in slices.

The first alternative is to only considerentity pairsp that are also present in at least one of setsRs1 , . . . , Rsr .


In other words, only entity pairsp ∈ Rs1 ∪ · · · ∪ Rsr are examined. The second alternative is to only consider

entitiese that appear in at least one of the pairs inRs1∪· · ·∪Rsr . However, in order to guarantee thecompleteness

of the result computed, the association measure employed must exhibit at least one of the following properties.

Definition 3.5. LetM(ei, ej) be an association measure. Let alsoRs be the entity pairs(ei, ej) that exceed the

association thresholdT (i.e., pairsM(ei, ej) ≥ T ) in slices. MeasureM is closed under the merging operation

if

s = s1 ∪ · · · ∪ sr, si ∩ sj = ∅ ⇒ Rs ⊆ Rs1 ∪ . . . ∪Rsr

Definition 3.6. LetM(ei, ej) be an association measure. Let alsoRs be the entity pairs(ei, ej) that exceed the

association thresholdT (i.e., pairsM(ei, ej) ≥ T ) in slices. MeasureM is weakly-closed under the merging

operationif

s = s1 ∪ · · · ∪ sr, si ∩ sj = ∅

⇒ ∀(e1, e2) ∈ Rs, ∃(e1, ∗), (e2, ∗) ∈ Rs1 ∪ . . . ∪Rsr

For a closed measure, only pairs that were identified previously as associated need to be considered. For

a weakly-closed measure, algorithm THR-ENT is applied, but only on entities that appear inRs1 ∪ . . . ∪ Rsr .

However, not all association measures exhibit these two properties.

Lemma 3.2. The Jaccard and Dice measures are closed (and hence weakly-closed) under the merging operation,

while theX 2 and Likelihood Ratio tests are not even weakly-closed.

Proof. Let us prove that the Jaccard measure is closed under the merging operation. Consider an entity pair

(ei, ej). The Jaccard value of the two entities isJ(ei, ej) =|ei∩ej ||ei∪ej |

. Consider also disjoint slicesΩ1, . . . ,Ωr and

sliceΩ = Ω1

⋃ · · ·⋃Ωr. Because the slices are disjoint, we have that

|eΩi ∩ eΩj | = |eΩ1

i ∩ eΩ1

j |+ · · ·+ |eΩr

i ∩ eΩr

j |

In other words, the total number of co-occurrences in sliceΩ is the sum of co-occurrences in each of the disjoint

slicesΩ1, . . . ,Ωr. Similarly,

|eΩi ∪ eΩj | = |eΩ1

i ∪ eΩ1

j |+ · · ·+ |eΩr

i ∪ eΩr

j |

The above equality implies that

|eΩx

i ∪ eΩx

j | = ax|eΩi ∪ eΩj |, 1 ≤ x ≤ r,r

∑

x=1

ax = 1

In other words, each set|eΩx

i ∪ eΩx

j | is a fractionax of |eΩi ∪ eΩj |. Using the above equalities, we have that:


J(eΩi , eΩj ) =

|eΩi ∩ eΩj ||eΩi ∪ eΩj |

=

∑rx=1 |eΩx

i ∩ eΩx

j ||eΩi ∩ eΩj |

=r

∑

x=1

|eΩx

i ∩ eΩx

j ||eΩi ∩ eΩj |

=

r∑

x=1

|eΩx

i ∩ eΩx

j |1ax|eΩx

i ∩ eΩx

j |=

r∑

x=1

ax|eΩx

i ∩ eΩx

j ||eΩx

i ∩ eΩx

j |

=

r∑

x=1

axJ(eΩx

i , eΩx

j )

Hence, the Jaccard value for pair(ei, ej) in slice Ω is a convex combination of its Jaccard values in the

disjoint slicesΩx comprisingΩ. The fact that Jaccard is closed under the merging operationfollows from the

above observation. IfJ(eΩx

i , eΩx

j ) < T for all x, thenJ(eΩi , eΩj ) < T . On the other hand, if there exists anx such

thatJ(eΩx

i , eΩx

j ) ≥ T , then it is possible, but not certain thatJ(eΩi , eΩj ) ≥ T .

The proof for the Dice measure is similar. The fact thatX 2 and Likelihood Ratio tests do not exhibit the

desirable properties can be demonstrated by (counter)example (Section 3.8).

Therefore, in the case of set-similarity measures of association like the Jaccard and Dice Coefficient, and any

other measure that is closed or weakly-closed under the merging operation, the search space of possible entity

associations can be significantly reduced, without sacrificing the correctness of the computed result.

3.6.2 Problem 2: Top-k Variation

Closure properties, similar to the ones stated above for Problem 3.1), are rarely (if ever) observed for top-k

problems (e.g., [79]). This is the case with top-k Problem 3.2. By means of counterexamples, one can demonstrate

that examining only the top-k (or any non-trivialk′ > k) strongest associations, computed previously for each of

the ancestors of a slices, does not guarantee the correctness of the top-k result fors.

Although no closure guarantees exist, the potentially large performance benefits offered by the possibility of

examining only a handful of entity associations for each slice with ancestors justify experimenting with such an

approach. In Section 3.8 we present experimental results that attempt to quantify the performance benefit/accuracy

trade-off of using only ancestral top-k pairs, for different association measures.

3.7 Extensions

3.7.1 Computing Strongly Associated Groups of Entities

The focus of our discussion so far has been on the computationof associatedpairs of entities. For most applica-

tions the detection of strongly associated pairs of entities should be sufficient for exposing interesting stories in


the underlying document collection. However, these techniques can be extended to support the computation of

associatedgroupscontainingr entities.

While set-similarity association measures presented in Section 3.3 are not extended in a straightforward man-

ner to groups of more than two entities, measures of statistical correlation are. For instance, the Likelihood Ratio

for three variables is:

L(ei, ej, ek) =∑

x∈0,1

∑

y∈0,1

∑

z∈0,1

2Nxyz lnNxyz

Exyz

VariablesNxyz andExyz have similar semantics as in the two-entity case. Notice that in order to evaluate the

association of three entitiesei, ej , ek, the co-occurrences of all three entitiescijk, their pairwise co-occurrences

cij , cik, cjk and their frequenciesci, cj , ck, are all required. This is also the case for theX 2 test. In principle, any

measure for assessing the association of three (or more) entities should have access to these co-occurrences.

The need for this additional information complicates the application of algorithms THR-ENT and TOP-ENT

(as presented before) on groups ofr entities. The source of this complication is that the algorithms cannot utilize

inverted indices in order to perform the equivalent of row-at-a-time co-occurrence computation. Inverted indices

are by design data structures that reveal the relation between pairs of entities and cannot be used to extract higher

than pairwise co-occurrences.

However, the use of inverted indices is not critical for the application of algorithms THR-ENT and TOP-ENT.

While the algorithms do take advantage of inverted indices in the case of entity pairs in order to offer improved

performance, they can still compute entity co-occurrencesby simply intersecting the relevant document-lists. For

instance, all the appropriate co-occurrences for assessing the association of entitiesei, ej andek are derived by

intersecting listsei, ej andek.

The power of our algorithms lies in their ability to utilize the notion ofpotentialin order to prune groups that

are provably weakly associated (THR-ENT) or evaluate their association in descending order of theirpotential

(TOP-ENT). The notion and use of potential carries over to groups of entities.

Let us illustrate this by focusing on groups of three entities. Consider an ordering of the entitiese1, . . . , em

in descending order of their frequencies, i.e.,ci ≥ cj for i < j. The potential of three entitiesci, cj , ck can be

derived by assuming thatcij = min(ci, cj), cik = min(ci, ck), cjk = min(cj , ck) andcijk = min(ci, cj , ck).

Hence, the potential of a triplet is a functionP (ci, cj , ck). The property exploited by algorithms THR-ENT and

TOP-ENT is that forci ≥ cj ≥ ck ≥ c′k, it holds thatP (ci, cj , ck) ≥ P (ci, cj , c′k).

Therefore, for each pair of entities(ei, ej), we can reason about the potential of the triplets(ei, ej , ek) resulting

from the expansion of the pair with one additional entityek. This is utilized by algorithm THR-ENT to perform

a binary search for each pair(ei, ej) in order to identify the triplets(ei, ej, ek) that can exceed the association

threshold and only evaluate the association of those (Section 3.4.2). Algorithm TOP-ENT inserts into heapH an


iteratorITij for each pair of entities in order to generate entity triplets in descending order of their potential and

evaluate their true association in that order.

3.7.2 Operating on a Random Sample of the Collection

The techniques proposed to compute entity associations operate in a main memory environment. While the

techniques are nimble in terms of memory requirements and are applicable to large document collections on

commodity hardware (Section 3.8), it is plausible that the sheer size of the document collection or hardware con-

straints could introduce complications. In such a scenario, algorithms THR-ENT and TOP-ENT can be seamlessly

transformed into approximate algorithms by operating on auniform random sampleof the document collection.

Consider anx% uniform random sample ofn documentsd1, . . . , dn from the collection. It holds thatn/n =

x/100. Consequently, the document-listse associated with entitiese1, . . . , em form a coordinated random sample

of the full document-listse. Although for each entity only a sample of its document-listis now available, it is

not hard to compute and maintain its frequency in the entire document collection. Therefore, for each entity

ei we are aware of its true frequencyci. During the application of algorithms THR-ENT and TOP-ENT, the

use of the sampled document-lists will result in the computation of entity co-occurrencescij : these are the co-

occurrences of entitiesei andej within the documents comprising the sample. The true numberof co-occurrences

can then be estimated in an unbiased manner by appropriatelyscaling-up the co-occurrence computed, i.e., we

usecij = 100cij/x when evaluating the association of pair(ei, ej), along with the true entity frequenciesci, cj .

A similar approach is presented in [138].

3.8 Experimental Evaluation

We performed a significant number of experiments to evaluatethe performance of algorithms THR-ENT (Problem

1, Threshold Variation) and TOP-ENT (Problem 2, Top-k Variation) and the baseline technique (Algorithm 3.1),

referred to as algorithm ALL -PAIRS. Algorithms THR-ENT and TOP-ENT can optionally leverage previously

computed entity associations (Section 3.6). We denote with*-ENT + ANCESTORSP the algorithm thatonlyeval-

uates the association of pairs previously identified in someancestral slice. We denote with *-ENT + ANCESTORS

E the application of algorithm *-ENT (TOP-ENT or THR-ENT) on the entities assessed as associated in some

ancestral slice.

Our real data set used is comprised of a large number of blog posts, associated with attributes describing

their author’s demographic information, and pre-processed by the entity-extractor employed in [9]. The data set

is used to demonstrate the superior performance offered by algorithms THR-ENT and TOP-ENT over algorithm

ALL -PAIRS. The multiple slices of the post collection allow us to verify the efficiency of the lightweight data-


assembly infrastructure presented in Section 3.5. We also demonstrate the additional performance benefits offered

by algorithms *-ENT + ANCESTORSP and *-ENT + ANCESTORSE and, in the case of Problem 2, quantify the

resulting accuracy penalty. The experimental setting and results are detailed in Section 3.8.2.

Since the real data set employed has fixed characteristics (i.e., number of documents, number of entities

extracted, number of entities per document, etc.), the performance of algorithms ALL -PAIRS, THR-ENT and

TOP-ENT is also compared on single, synthetically generated sliceswith varying characteristics. Details of the

data generation process and the corresponding experimental results appear in Section 3.8.3.

All the techniques were implemented using 64-bit Java 6. Ourtest platform was a 2.4Ghz AMD Opteron 850

processor. In all our experiments, the Java VM was constrained to use up to 10GB of memory.

3.8.1 Association Measures

Although the proposed solutions were explicitly designed to support virtually any plausible measure of associa-

tion, theuniquebehavior of the Likelihood Ratio (LR) test, with respect to other commonly employed measures,

renders it ideal for detecting associations of maximum utility.

Let us first contrast the behavior of the LR test with that of the ubiquitousX 2 test. The normality assumptions

adopted by theX 2 test can lead to gross overestimation of the association between less frequent entities [54]. For

instance, two entities withci = 2, cj = 2 andcij = 2 would be declared correlated with maximum confidence

by theX 2 test, while we intuitively understand that not enough mentions of the two entities have been observed

in order to safely reach a conclusion. The LR test avoids suchoverestimation. This limitation of theX 2 test is

particularly pronounced in our setting where the entity frequency distribution is heavy-tailed.

Like theX 2 test,set-similaritybased measures present the same limitations and disproportionately “reward”

high relative overlap between the document-lists of two entities, without accounting for entity frequencies. The

LR test “rewards” in an intuitive manner both considerable overlap and high entity frequency. Consider, for

instance, an entity pair(ei, ej) with entity frequenciesci, cj and co-occurrence countcij , and a second entity

pair (e′i, e′j) whose corresponding counts are equal toc′i = aci, c′j = acj , c′ij = acij anda > 1. The Jaccard

Coefficient of the two pairs is equal, while the Likelihood Ratio of pair (e′i, e′j) is considerably higher than that of

(ei, ej).

As a result, the most associated pairs identified by the Jaccard Coefficient (as well as theX 2 test and other

set-similarity measures) involveinfrequent entities which occur (almost) always together. Such associations are

uninformative and unrepresentative of interesting and prevalent themes present in the underlying document collec-

tion. On the other hand, the most associated pairs identifiedby the LR test tend to involve more frequent entities

which additionally demonstrate considerable overlap. This behavior cannot be replicated by simply enforcing a


threshold on the minimum frequency of entities considered,and using a different association measure.

Given that our intent is to use the associations computed in order to expose prevalent and interesting themes,

the top associations computed by the LR test are highly relevant. As an example, Table 1 presents that top-5 entity

pairs under the LR measure and their corresponding Jaccard values, discovered in 290k blog posts collected on

February 12th, 2009. For an in-depth demonstration of the utility of the proposed techniques, when combined

with the LR test, in exposing interesting themes in the document collection, see [9].

Subsequently, we focus our evaluation on the LR test, due to its robustness and unique behavior, and the

Jaccard coefficient since it is the most widely used set-similarity measure, and its behavior is representative of

other measures.

ei ej ci cj cij Jaccard

1. Barack Obama White House 2006 9352 1304 0.13

2. Charles Darwin On the Origins of Species 2927 848 664 0.21

3. Israel Gaza 2161 947 586 0.23

4. Microsoft MS Windows 2415 2549 736 0.17

5. David Letterman Joaquin Phoenix 1280 686 442 0.29

Table 3.1: Top-5 Likelihood Ratio associations.

3.8.2 Performance on Real Data

Our real-data extended document collection is comprised of1.4M blog posts, collected during the Summer of

2008. Each post is associated with meta-data attributes describing the demographic information of its author

(Age, Gender, Country, Profession), as well as the author’s“Influence”, quantified by the number of links to

his/her blog during the previous year. The domain size of attribute Age is 7 (age ranges), of Gender is 2, of

Country is 224, of Profession is 42 and of Influence is 5 (ranges of in-link values).

The posts were pre-processed by the entity extractor of [9].The extractor wasnot altered in any way for the

purposes of the experimental evaluation. On a high-level, the extractor maintains a dictionary comprised of 5M

strings, corresponding to approximately 1M entities3. Each entity is associated with 5 strings on average (alter-

native spellings, common misspellings, synonyms, etc.). The posts are scanned for occurrences of the dictionary

strings and every exact match is mapped to the correspondingentity. The extraction process revealed the mentions

of 280K unique entities. On average, a post contains 3.8 entities.

Figure 3.5 presents the total time required by the differenttechniques to address Problems 1 (Figure 3.5(a))

and 2 (Figure 3.5(b)), when using the Likelihood Ratio (LR) measure. The time figures include the overhead of

3Circa 2008, since [9] is an evolving project.


slice merging, which was only 33 seconds, a tiny fraction of the total time required. Furthermore, only slices

comprised of at least 500 posts were processed: smaller slices do not provide meaningful entity associations

and were therefore ignored. Lastly, within each slice we only considered entities that appeared with a relative

frequency of 5 times in every 1 million posts, in order to disqualify associations involving rare entities.

Figure 3.5(a) presents the performance of algorithms ALL -PAIRS and THR-ENT on Problem 1, for a wide

range of threshold values. The performance of algorithms THR-ENT + ANCESTORSP and THR-ENT + ANCES-

TORSE is omitted. Likelihood Ratio is a non-closed measure and the application of the algorithms for Problem 1

resulted in low accuracy.

Algorithm THR-ENT offers a substantial performance benefit that becomes more pronounced as the threshold

value increases. Higher threshold values allow algorithm THR-ENT to process only a small number of entity pairs

per row of co-occurrences matrixC (Section 3.4.2) and even ignore rows corresponding to less frequent entities

entirely, since the LR measure assigns small association values to such entities.

Figure 3.5(b) depicts the performance of all alternatives for computing the top-k most associated pairs in

each slice of the post collection (Problem 2), for differentvalues ofk. As it is evident, algorithm TOP-ENT

clearly outperforms ALL -PAIRS, although the performance gap decreases as the number of requested associations

k increases, due to TOP-ENT being a Threshold algorithm sensitive to the value ofk. The use of ancestral top-k

pairs further reduces the overall computation time, at the expense of correctness (to be subsequently quantified).

Notice that since we only process slices containing more than 500 posts, some slices might not have access to the

top-k result of all their ancestors (including slices with no ancestors). Algorithms TOP-ENT + ANCESTORSP

and TOP-ENT + ANCESTORSE utilized the ancestral top-k associated pairs only when they are available for at

least3/4s of the slice’s ancestors. Otherwise, algorithm TOP-ENT was applied.

0 200 400 600 800 10000

100

200

300

400

500

600

Likelihood Ratio Threshold

Tim

e (s

)

All-Pairs

Thr-Ent

0 200 400 600 800 10000

100

200

300

400

500

600

k

Tim

e (s

)

All-Pairs

Top-Ent

Top-Ent + Ancestors E

Top-Ent + Ancestors P

Problem 1

(a)

Problem 2

(b)

Figure 3.5: Likelihood Ratio performance, real data.

The decrease in accuracy resulting from the use of algorithms TOP-ENT + ANCESTORSP and TOP-ENT +

ANCESTORSE is depicted in Figure 3.6. We define as therecall of a computed top-k result, the fraction of “true”


0 200 400 600 800 10000.5

0.6

0.7

0.8

0.9

1

1.1

k

Rec

all

0 200 400 600 800 10000.5

0.6

0.7

0.8

0.9

1

1.1

k

Rec

all

Top-Ent + Ancestors E Top-Ent + Ancestors P

(a) (b)

Figure 3.6: Likelihood Ratio recall, real data.

top-k pairs that are present in the top-k computed. Therefore, Figure 3.6 presents the average recall plus/minus

one standard deviation, across all the slices for which ancestral top-k pairs were used. The recall values computed

do not account for the slices that were processed by the plainTOP-ENT algorithm and, thus, have a recall of 1.

If these slices are also taken into account, the average recall is considerably higher. We also experimented with

the possibility of using a different set of ancestors to generate each slice, but the observed average recall was not

substantially affected.

The results presented in Figure 3.6 imply the existence of considerable overlap among thek entity pairs with

the highest Likelihood Ratio values, in the currently considered slice and its ancestors. This overlap can be

attributed to the fact that the LR test assigns high association values to pairs comprised of entities with relatively

high frequencies: such associations remain pronounced even after merging multiple disjoint slices. Whether

the additional performance benefit of algorithms TOP-ENT + ANCESTORSP and TOP-ENT + ANCESTORSE

warrants the small loss in recall is a system design issue; our goal was to expose the presence of this trade-off.

Corresponding experimental results for the Jaccard Coefficient measure are presented in Figure 3.7. The left

plot presents the total time required by the different techniques for Problem 1, while the right plot depicts the time

required for Problem 2. For Problem 2, the performance of algorithms THR-ENT + ANCESTORS* is omitted, as

low average recall was measured. Most of the trends observedare similar to those observed for the Likelihood

Ratio measure. Therefore, we point out the differences between the two measures.

First, notice that the performance benefit offered by algorithms *-ENT and their variants is substantial, but

not as large as in the case of the Likelihood Ratio measure. This can be attributed to the heavy tailed distribution

of entity frequencies. The Jaccard potential of an entity pair involving two entities with approximately equal

frequencies is high (e.g., it attains its maximum value for two entities with equal frequencies). Consequently, given

that many entities have low, approximately equal, frequencies (they appear at the tail of the frequency distribution),

their corresponding pairs have high potential. This results in fewer pruning opportunities for algorithms THR-ENT


0.2 0.4 0.6 0.8 1100

150

200

250

300

350

400

Jaccard Threshold

Tim

e (s

)

All-Pairs

Thr-Ent

Thr-Ent + Ancestors E

Thr-Ent + Ancestors P

0 200 400 600 800 1000100

150

200

250

300

350

400

k

Tim

e (s

)

All-Pairs

Top-Ent

(a)

Problem 1 Problem 2

(b)

Figure 3.7: Jaccard Coefficient performance, real data.

and TOP-ENT to exploit.

Second, algorithms THR-ENT + ANCESTORSP and THR-ENT + ANCESTORSE were also applied for Prob-

lem 1. Since the Jaccard Coefficient is closed under the merging operation, both algorithms compute the correct

result. However, in order to guarantee correctness, associated pairs computed previously are only utilized for

slices for which all ancestors have more than 500 posts. Otherwise, a plain THR-ENT algorithm is applied. Ob-

serve that algorithm THR-ENT + ANCESTORSP fails to deliver substantial performance improvement. The reason

is that for the Jaccard measure, a significant number of entity pairs for each slice exceeded the specified threshold.

Hence, the benefit of examining individual candidate pairs in each slice is negated by the need to perform many

inefficient document-list intersections in order to identify the co-occurrences of candidate entity-pairs. On the

other hand, algorithm THR-ENT + ANCESTORSE offers a substantial performance improvement.

3.8.3 Performance on Synthetic Data

We also performed experiments using synthetically generated data to observe algorithm behavior under varying

data characteristics. Algorithms ALL -PAIRS and THR-ENT (Problem 1) are applied on a single, synthetically

generated slice.

The experimental parameters influencing the performance ofalgorithms ALL -PAIRS and THR-ENT are (a) the

number of documents in the slicen, (b) the number of entitiesm mentioned in the documents, (c) the number of

entities per documentp and (d) the specified association thresholdT . We set the default values of these parameters

to n = 500k, m = 500k, p = 10 andT = 200 for the case of the Likelihood Ratio measure andT = 0.5 for the

case of the Jaccard Coefficient. Notice that the total numberof entity mentions is equal tonp.

Synthetic documents are generated as follows. Each of them entities is assigned an occurrence probability

according to Zipf’s Law. Each entity is assigned a rankr, and its occurrence probability is set to be proportional

to r−s, so that all occurrence probabilities sum up to one. The exponents of Zipf’s Law is set to0.8, the value


best fitting the entity frequency distribution that we observed in our real data set. Then, for each document we

independently selectp entities according to their occurrence probability.

Populating documents with independent entities fails to introduce strong associations. However, this does

not affect our results for two reasons. First, in practice the majority of entity pairs do appear independently,

with the exception of a few associated ones. Second, the amount of pruning performed by algorithm THR-ENT

is independent of the strength of the associations that actually exist and are eventually discovered. Pruning is

performed based on the assumption that all entity pairs are as strongly associated as possible (Section 3.4.2).

10k 100k 1M 10M10

2

103

104

105

Number of posts (n)

Tim

e (m

s)

10k 100k 1M 10M2000

4000

6000

8000

10000

12000

14000

16000

Number of entities (m)

Tim

e (m

s)

5 10 15 200

1

2

3

4

5

6

7x 10

4

Entities per document (p)

Tim

e (m

s)

10 100 1000 10000

0.5

1

1.5

2x 10

4

Likelihood Ratio Threshold

Tim

e (m

s)

All-Pairs

Thr-Ent

All-Pairs

Thr-Ent

All-Pairs

Thr-Ent

All-Pairs

Thr-Ent

(a) (b)

(c) (d)

Figure 3.8: Likelihood Ratio performance, Problem 1

Figures 3.8 and 3.9 present experimental results for the Likelihood Ratio and Jaccard Coefficient measures

respectively. All experimental parameters (n, m, p ,T ), other than the one varied in each plot, are set to the

default values stated before. Both figures verify that increased threshold values lead to substantial performance

improvement for algorithm THR-ENT. Furthermore, as the number of entities per document increases, so does

the processing that both algorithms need to perform, although their performance differential persists.

With respect to the number of entitiesm, as their number increases, the times required by ALL -PAIRS de-

creases. The reason is that even as the number of entities increase, the total number of entity mentions in doc-

uments remains fixed and equal tonp. Therefore, entities are associated with decreased frequencies (np/m on

average, wherenp is constant). Consequently, fewer inverted-index lists need to be merged for every entity, which

results in improved performance.


10k 100k 1M 10M10

2

103

104

105

Number of documents (n)

Tim

e (m

s)

10k 100k 1M 10M3000

4000

5000

6000

7000

8000

9000

10000

11000

Number of entities (m)

Tim

e (m

s)

5 10 15 200

1

2

3

4

5x 10

4

Number of entities per post (p)

Tim

e (m

s)

0.2 0.4 0.6 0.8 12000

4000

6000

8000

10000

12000

Jacccard Threshold

Tim

e (m

s)

All-Pairs

Thr-Ent

All-Pairs

Thr-Ent

All-Pairs

Thr-Ent

All-Pairs

Thr-Ent

(a) (b)

(c) (d)

Figure 3.9: Jaccard Coefficient performance, Problem 1

The fact that increasing the number of entities reduces their absolute frequencies explained the improving

performance of algorithm THR-ENT for the Likelihood Ratio measure (Figure 3.8), but not for the Jaccard Co-

efficient measure (Figure 3.9). As we discussed, pairs involving entities with low frequencies are assigned low

association potential by the Likelihood Ratio test (and aretherefore pruned), but high association potential by the

Jaccard Coefficient (and hence cannot be pruned).

A final interesting observation regards the diminishing performance benefit of algorithm THR-ENT for the

Likelihood Ratio measure (Figure 3.8),given that the thresholdT remains constant atT = 200. This can be

attributed to the properties of the Likelihood Ratio test. The statistical test assigns higher association values in

the context of larger slices. Since the statistical test quantifies the likelihood that the independence assumption

between two entities is violated (Section 3.3), a larger sample size (number of documents) results in higher

confidence (when the assumption is violated) and, hence, higher association values. Correspondingly, it also

results in higher potential values for entity pairs and therefore fewer pruning opportunities for a given threshold

value.

The experimental results from the comparison of algorithmsALL -PAIRS and TOP-ENT (Problem 2) are very

similar to Figures 3.8 and 3.9. Notice that ifK is the value of thek-th strongest association in the data set, TOP-

ENT will consider only those entity pairs whose potential exceedsK before terminating. Of course, the value of


K is not known in advance and will be identified as TOP-ENT terminates. Still, TOP-ENT will consider exactly

the pairs that THR-ENT will if we knew the value ofK and set it as THR-ENT’s threshold, i.e.,T = K, and

consequently has similar performance.

3.9 Conclusions

In this Chapter we presented techniques for the discovery ofentity associations in all the different slices of an

extended document collection. Such entity associations can be used to expose interesting underlying themes

in the collection. The computation of entity associations is performed efficiently by algorithms THR-ENT and

TOP-ENT. The application of the two algorithms on all the slices of the collection is supported by a light-weight

infrastructure, offering additional optimization opportunities and trade-offs that we explored in depth.

Chapter 4

Query Expansion for Extended

Documents

4.1 Introduction

In Section 2.2 we reviewedfaceted search, a successful technique that utilizes categorical meta-data attributes

organized in independent hierarchies. These are used to support effective exploration of an extended document

collection by enabling the progressive refinement of query results. One major drawback of faceted search is

that useful hierarchical attributes are hard-to-extract or available only within narrow domains. This limits the

utility of the approach for highly popular document domains, such as blog posts or large review collections of

arbitrary items, that exhibit high variance in their content. Instead, we would like to suggest ways torefinethe

original search result and explore the document collectionin a query-driven, domain-neutralmanner that is not

dependent on fixed hierarchical attributes. In this Chapterwe realize a solution with these desirable characteristics

that utilizes patterns extracted from the document’s textual component and user rating or sentiment information.

Such extended document meta-data are readily available foruser generated content like blog posts or user reviews

(Section 2.5).

To realize things concrete, consider a search for “Canon SD700” on a popular customer feedback portal. The

query can match thousands of reviews. In order to facilitatethe exploration of this result set, we would like to

identify on-the-fly product features, e.g., “lens” or “SLR capability”, which are discussed in the reviews. This

capability would be extremely helpful, especially for lessprominent products with more obscure features, and

would enable the refinement of the original query result to reviews that discuss a certain feature of interest.

In addition, we are interested in incorporating user feedback in the refinement process. Besides simply iden-

54

CHAPTER 4. QUERY EXPANSION FOREXTENDED DOCUMENTS 55

tifying product features, we would like to locate the ones, e.g., the camera’s “lens” in our example, which are

mentioned in reviews for which users have providedhigh on average ratings. Similarly, we should be able to

automatically locate other features, for instance the camera’s “SLR capability”, which are discussed in reviews

with low on average ratings. Finally, another helpful possibility is identifying features mentioned in reviews with

consistent, unanimous ratings, independently of whether they are actually good, bad or neutral.

Such functionality is quite powerful; it provides goal-oriented navigation of the reviews, as we can inter-

actively identify the product features (keywords) mentioned by satisfied consumers (high ratings), dissatisfied

consumers (low ratings) or consumers that have reached a consensus (consistent ratings) and use them to refine

the initial query result and drill down to examine the relevant reviews.

In this spirit, we propose a new extended document exploration model that enables the progressive refinement

of a keyword-query result set. In contrast to faceted searchwhich utilizes hard-to-extract hierarchical attributes,

the refinement process is driven by suggesting interestingexpansionsof the original query with additional search

terms. Document meta-data in the form of a user rating or sentiment are utilized in identifying expansions. We

refer to this iterative exploratory process asMeasure-driven Query Expansion. More specifically,

• We introduce three principled scoring functions to quantitatively evaluate in a meaningful manner the inter-

estingness of a candidate query expansion. Our first scoringfunction utilizes surprising word co-occurrence

patterns to single out interesting expansions (e.g., expansions corresponding to product features discussed

in reviews). Our second and third functions incorporate theavailable user ratings (or sentiment) in order

to identify expansions that define clusters of extended documents with either extreme ratings (e.g., product

attributes mentioned in highly positive or negative on average reviews) or consistent ratings (e.g., features

present in unanimous reviews).

• The query expansion functionality is supported by a unified,computationally efficient framework for identi-

fying thek most interesting query expansions. Our solution is grounded on Convex Optimization principles

that allow us to exploit the pruning opportunities offered by the natural top-k formulation of our problem.

• We verify the performance benefits of the solution using bothsynthetic data and large real data sets com-

prised of blog posts.


review existing work. Section 4.3 formally introduces the query expansion problem, while Section 4.4 describes

our baseline implementation. Section 4.5 introduces our improved solution, whose superiority is experimentally

verified in Section 4.6. Lastly, Section 4.7 discusses useful extensions of our techniques and Section 4.8 offers

our conclusions.



Recent work [138] on improving faceted search proposed the use of “dynamic” facets extracted from the content of

the documents: a result set can be further refined using frequent phrases appearing in the documents comprising

the result. Similarly to the query expansion functionalitythat we propose, it does not rely on hard-to-extract

hierarchical attributes to drive the refinement process. Nevertheless, the proposed query expansion technique

suggests refinements in a manner that is more principled and elaborate (we suggest refinements byoptimizing

three diverse measures of interestingness, two of which utilize omnipresent user ratings) and scalable (we utilize

pairs of tokens found in documents; [138] stores and manipulates phrases of length up to 5).

The rated aspect summarization technique of Yue et al. [103]automatically extracts aspects (e.g., product

features) from a set of extended documents (e.g., reviews ofa particular product), each one associated with an

overall rating derived from individual document ratings. However, the technique is applied off-line to fixed subsets

of the document collection (e.g., to reviews of each product). Our approach extracts “rated aspects” on-the-fly,

for an ad-hoc subset of extended documents identified by a keyword query.

Most major search engines provide query expansion functionality in the form of query auto-completion. How-

ever, the suggested expansions are ranked in adata-agnostic manner, such as based on their popularity in query

logs [18].

4.3 Measure-driven Query Expansion

Consider a collection of extended documents denoted byD, each one associated with a numerical “rating” from

a small domain of valuess1, . . . , sb (typically b ≤ 10). This piece of meta-data can correspond to a user-supplied

rating, or a measure of how positive or negative is the sentiment expressed in the document.

In addition, we consider a set of wordsW that appear in the extended documents of our collection. The

composition ofW depends on the application context and will not affect our subsequent discussion. As an

example, it can simply be the full set or a subset of the words comprising the documents, the contents of a

dictionary, or a well-specified set of words relevant to the application.

Definition 4.1. A word-setF is a set ofr distinct words fromW , i.e.,F ∈ Powerset(W) and|F | = r.

Definition 4.2. A collection of word-setsFr(w1, . . . , wl), l < r is comprised of all word-setsF with F ∈

Powerset(W) and|F | = r andw1, . . . , wl ∈ F .

Thus, a word-set is a set of distinct words fromW , while collectionFr(w1, . . . , wl) consists of all word-

sets of sizer, subject to the constraint that they must always contain wordsw1, . . . , wl. The following example

clarifies the definitions and illustrates how they relate to our goal of suggesting interesting query expansions.


Example 4.1. Let D be a collection of extended documents associated with a numerical value from domain

s1, . . . , sb. Let alsoW be the set of words appearing inD, after the removal of common and stop words. A

query that retrieves all the documents inD containing wordw1 is issued. Let us denote the result of this query as

Dw1 . At this point, we suggest a small numberk of potential expansions of the original query by two additional

keywords (the size of the expansion is a parameter). The candidate expansions are the word-sets belonging to

F3(w1) (sets containing 3 words, one of which is definitelyw1). Therefore, our goal is to suggestk expanded

queries(word-sets) fromF3(w1) that can be used to refine the initial search result in a mannerthat is interesting

and meaningful (Section 4.1).

As the above example illustrates, at each step the functionality of suggesting ways toexpandthe keyword

queryq = w1, . . . , wl andrefinethe current set of results in an interesting manner can be formulated as the selec-

tion of k word-sets from a collectionFr(w1, . . . , wl).

MEASURE-DRIVEN QUERY EXPANSION: Consider an extended document collectionD and a keyword query

q = w1, . . . , wl on D. Let Dq be the set of documents inD that satisfy the query. Queryq can either be the

first query submitted to the system, or a refined query that wasalready proposed. Then, the problem ofquery

expansionis to suggestk word-sets of sizer from Fr(w1, . . . , wl) that extendq and can be used to focus on a

particularly interesting subset ofDq.

Notice that the ability to perform this operation implies the use of conjunctive query semantics, i.e., a docu-

ment needs to containall search terms in order to be considered a valid result.

So far in our discussion we have purposefully avoided mentioning what would constitute a query expansion

that yields an “interesting” refinement of the initial result. In order to be able to single outk expansions to a

keyword query, we need to define quantitative measures of interestingness. In what follows, we offer examples of

interesting and meaningful query refinements that we subsequently formalize into concrete problems that need to

be addressed.

4.3.1 Defining Interesting Expansions

Example 4.2. Consider a search for “digital camera” on a collection of product reviews. In the documents that

contain these query terms and comprise the result set, we expect to encounter terms such as “zoom”, “lens” or

“SLR” frequently. The reason is that these terms are highly relevant to digital cameras and are therefore used in

the corresponding reviews. Thus, the probability of encountering such terms in the result set is much higher than

that of encountering them in general text, unrelated to digital cameras.


We formalize this intuition with the notion ofsurprise[137, 30, 53]. Letp(wi) be the probability of word

wi appearing in a document of the collection andp(w1, . . . , wr) be the probability of wordsw1, . . . , wr co-

occurringin a document1. If wordsw1, . . . , wr were unrelated and were used in documentsindependentlyof one

another, we would expect thatp(w1, . . . , wr) = p(w1) · · · p(wr). Therefore, we use a simple measure to quantify

by how much the observed word co-occurrences deviate from the independence assumption. For a word-set

F = w1, . . . , wr, we define

Surprise(F ) =p(w1, . . . , wr)

p(w1) · · · p(wr)

We argue that when considering a number of possible query expansionsFr(w1, . . . , wl), word-sets with high

surprise values constitute ideal suggestions: we identifycoherent clusters of documents within the original result

set that are connected by a common underlying theme, as defined by the co-occurring words.

The use of surprise (unexpectedness) as a measure of interestingness has also been vindicated in the data

mining literature [137, 30, 53]. Additionally, the definition of surprise that we consider is simple yet intuitive and

has been successfully employed [30, 53].

Example 4.3. Consider a collection comprised of 250k documents and query“table, tennis”. Suppose that there

exist 5k documents containing “table”, 2k documents containing “tennis” and 1k documents containing both

words “table, tennis”. We easily compute thatSurprise(table,tennis)=25.

Let us compare the surprise value of two possible expansions: with term “car” (10k occurrences) and term

“paddle” (1k occurrences). Suppose (reasonably) that “car” is not particularly related to “table, tennis” and

therefore co-occurs independently with these words. Then,there exist 40 documents in the collection that contain

all three words “table, tennis, car” (Figure 4.1). We compute that Surprise(table,tennis,car)=25. While this

expansion has a surprise value greater than 1, this is due to the correlation between “table” and “tennis”.

Now, consider the expansion with “paddle” and assume that 500 of the 1000 documents containing “table,

tennis” also contain “paddle” (“table, tennis, paddle”). We compute thatSurprise(table,tennis,paddle)=3125.

As this example illustrates, enhancing queries with highlyrelevant terms results in expansions with considerably

higher surprise values than enhancing them with irrelevantones.

The maximum-likelihood estimates of the probabilities required to compute the surprise value of a word-set

are derived from the textual data of the extended document collectionD under consideration. We usec(F ) =

c(w1, . . . , wr) to denote the number of documents in a collectionD that contain allr words ofF . In the same

spirit, we denote byc(wi) the number of documents that contain wordwi andc(•) the total number of documents

in the collection. Then, we can estimatep(w1, . . . , wr) = c(w1, . . . , wr)/c(•) andp(wi) = c(wi)/c(•). Using

1If considered appropriate, more restrictive notions of co-occurrence can also be used, e.g., the words appearing within the same paragraph in a document.


surprise(table,tennis)=25

table

car

ind

1k

5k

tennis2k

10k

i nd40

surprise(table,tennis,car)=25

table

paddle

550

1k

5k

tennis2k

1k

520500

surprise(table,tennis,paddle)=5000

Figure 4.1: Query expansion based on the Surprise measure.

these estimates, we can write

Surprise(F ) =c(F )/c(•)

c(w1)/c(•) · · · c(wr)/c(•)(4.1)

Therefore, one of the problems we need to address in order to suggest meaningful query expansions is,

Problem 4.1. Consider a collectionD of extended documents and a word-setq = w1, . . . , wl. We wish to

determine thek word-setsF ∈ Fr(w1, . . . , wl) with the maximum value of Surprise(F ).

Our first function does not assume the presence of meta-data in addition to the textual content of the docu-

ments. As a matter of fact, the query refinement solution based on the notion of surprise can be applied to any

document corpus. Nevertheless, for our two subsequent formulations of interestingness we utilize the numerical

“rating” associated with each document.

The document meta-data point to two natural and meaningful ways of suggesting query expansions. Notice

that every possible expansionF ∈ Fr(w1, . . . , wl) of the initial query is associated with a subset of our document

collection denoted asDF . Then, themean ratingand thevariance of the ratingsof the documents inDF can be

used to quantify the interestingness of the query expansionconsidered.

Example 4.4. Assume that our extended document collection is comprised of electronic gadget reviews, asso-

ciated with the “star” rating that the reviewer assigned to the device. Then, if the original query is “Canon

SD700”, we can strengthen it with additional terms related to product features, so that the expanded query leads

to a cluster of reviews with high on average “star” ratings, e.g., “Canon SD700 lens zoom”. Such expansions

would be highly interesting and aid users to quickly identify the product attributes for which other consumers are

satisfied. Another alternative is to offer suggestions thatwould lead to groups of reviews with consistent ratings

(low variance), thus facilitating the location of featuresfor which a consensus on their quality has emerged.

More formally, for a word setF = w1, . . . , wr, letDF be the extended documents inD that contain all words


in F . Furthermore, letc(F |si) be the number of extended documents inDF that are rated withsi. Then, the

average rating of the extended documents inDF is

Average Rating:Avg(F ) =b

∑

i=1

sic(F |si)/b

∑

i=1

c(F |si) (4.2)

The variance of the ratings inDF is equal to

Var. of Ratings:V ar(F ) =

b∑

i=1

s2i c(F |si)/b

∑

i=1

c(F |si)−Avg(F )2 (4.3)

Having demonstrated how to compute the mean value and the variance of the ratings associated with the result

of a query expansion, we can formally state the two additional problems that we need to address.

Problem 4.2. Consider a collectionD of extended documents associated with a numerical value from domain

s1, . . . , sb, and a word-setq = w1, . . . , wl. We wish to determine thek word-setsF ∈ Fr(w1, . . . , wl) with either

theminimumor themaximumvalue ofAvg(F ).

Problem 4.3. Consider a collectionD of extended documents associated with a numerical value from domain

s1, . . . , sb, and a word-setq = w1, . . . , wl. We wish to determine thek word-setsF ∈ Fr(w1, . . . , wl) with the

minimumvalue ofV ar(F ).

Hence, the problem of suggesting a few meaningful and interesting query expansions is formulated as three

separate top-k problems (Problems 4.1, 4.2 and 4.3). Addressing Problem 4.1 produces thek word-sets/expansions

with the highest surprise values (e.g., product features related to the query), Problem 4.2 expansions leading to

documents with extreme ratings (e.g., features mentioned in highly positive on average reviews) and Problem 4.3

expansions leading to documents with consistent ratings (e.g., features mentioned in unanimous reviews).

Notice also that Problems 1,2 and 3 specify two input parameters in addition to the queryq to be expanded:

the length of the expansionr and the number of required expansionsk. The techniques that we subsequently

develop can handle arbitrary values of those parameters. Appropriate values forr andk depend on the application.

However, in practice the number of suggested expansions will normally be a fixed small constant (e.g.k = 10,

see examples in [21]). Likewise, a query will be expanded by 1or 2 additional terms, i.e.,r = l+ 1 or r = l+ 2,

wherel is the length of queryq.

4.4 Implementing Query Expansion

Let us concentrate on Problem 4.1 that involves identifyingthek word-setsF ∈ Fr(w1, . . . , wl) that maximize

expression (4.1). The problem can be solved by computing thesurprise value of every candidate word-set and


identifying the top-k ones. We argue that the main challenge in solving the problemin that manner is calculating

the surprise value of a candidate word-set.

The difficulty arises from our need to determine the value ofc(F ), i.e., the number of documents that contain

all words inF . Of course, expression (4.1) requires the number of occurrences in the corpus for single words, as

well as the size of the corpus, i.e., countsc(wi) andc(•) respectively. However, for all practical purposes, these

counts can be easily computed and manipulated. In order to compute a word-set’s surprise value we need to focus

our attention on determining the value ofc(F ). This observation is also valid for Problems 4.2 and 4.3: in this

case the challenge is to determine countsc(F |si), i.e., the number of word co-occurrences conditioned on the

numerical rating of the extended documents, which is a problem equally hard to determiningc(F ).

In what follows, we argue that the naive approaches offully materializingand retrieving on-demand (Section

4.4.1) all possible word co-occurrencesc(F ) is infeasible for large document collections, while performing no

materialization(Section 4.4.2) at all is extremely inefficient. Instead, wepropose an alternative approach that

is based onestimatingco-occurrencesc(F ) by utilizing materialized, lower-order co-occurrences ofthe words

comprisingF (Section 4.4.3).

4.4.1 Full Materialization

Suppose that we allow query expansions up to sizer = 5. Then, the full materialization approach would need to

generate, store and manipulate all two, three, four and five-way word co-occurrences. However, this is infeasible

even for moderately large collections.

Let us demonstrate this using a simple example and concentrate on the computation part for the occurrences

of word-sets of size four. The pre-computation of these occurrences would involve processing the collection one

document at a time, generating all four-word combinations present in the document and temporarily storing them.

Then, that data would need to be aggregated and stored. If on average a document contains 200 distinct words,

each document would generate 65 million four-word tuples. If the corpus contains 10 million documents, we

would need to generate and aggregate 650 trillion tuples. Asthis trivial exercise demonstrates, the combinatorial

explosion in the amount of generated data renders the explicit handling of high-order co-occurrences impossible.

4.4.2 No Materialization

While materializing all high-order word co-occurrences isimpossible for large document collections, materializ-

ing no information at all would be extremely inefficient. As an example, consider a two-word query that we wish

to expand with two additional words. Since we have no knowledge of four-way word co-occurrences, in order

to evaluate the candidate expansions we would need to compute them on the fly. That would involve performing


random I/O in order to retrieve all documents that satisfy the original query and process them in order to compute

all two-way word co-occurrences in the documents (since twowords out of the required four are fixed). It is

evident that the I/O and CPU cost of this operation is prohibitively high. It would only make sense if the original

result was comprised of a handful of documents, but in that case, the refinement of such a result wouldn’t be

necessary.

4.4.3 Partial Materialization

The proposed implementation of the query expansion functionality lies in-between the two aforementioned ex-

tremes, offering a solution that is both feasible (unlike full materialization) and efficient (unlike no materialization

at all). To accomplish this, we propose the materializationof low-order word co-occurrences and their use in the

subsequentestimationof higher-order word co-occurrences. This process involves the computation and storage

of the occurrences of word-sets up to sizel, for a reasonable value ofl, and their use in the estimation of the

occurrences of arbitrary size word-sets.

Based on this high-level idea, algorithm DIRECT (Algorithm 4.1) presents a unified framework for addressing

problems 4.1, 4.2 and 4.3. Given a queryq, we need to suggestk expansions of sizer that maximize either one

of the three scoring functions. In order to do so, we iterate over all candidate word-setsF ∈ Fr(q). For every

candidate word-setF , we use the low-order co-occurrences (up to sizel) of the words comprisingF andestimate

the number of documentsc(F ) that contain all the words inF . For scoring functions (4.2) and (4.3) that require

the co-occurrence values conditioned on the document rating, we derive a separate estimate for every rating value.

Finally, the estimated high order co-occurrences are used to evaluate the interestingness of the candidate expansion

and its value is compared against the current list of top-k expansions.

A natural question that arises at this point is why, unlike many other top-k query evaluation problems, we need

to examine every candidate word-set inFr(q). Indeed, there exists a wealth of techniques that support the early

termination of the top-k computation, before examining the entire space of candidates [56] and without sacrificing

correctness. However, these algorithms require the scoring function to be monotone. It has been established that

the co-occurrence estimation process does not exhibit monotonicity properties that can be exploited [53, 30].

In order to realize algorithm DIRECT, we need to address in an efficient manner two challenges: (a)the pro-

gressive generation of candidate expansions and the retrieval of the corresponding low-order word co-occurrences

and (b) their use in the estimation of the desired high-orderco-occurrences.


Algorithm 4.1 DIRECT

Input : Queryq, expansion sizer, result sizek

TopK = ∅

Iterator.init(q, r)

while Iterator.hasMore() do

〈F,CountsF 〉 = Iterator.getNext()

for i = 1 to b do

c(F |si) = Estimate(〈CountsF 〉)

end for

Score(F ) = Compute(c(F |s1), . . . , c(F |sb))

if Score(F ) > TopK.thresholdthen

Topk.update(F )

end if

end while

return TopK


Generation of candidate word-sets

The solution suggested pre-computes and manipulates the occurrences of word-sets up to sizel. For most appli-

cations, the use oftwo-word co-occurrencespresents the most reasonable alternative. Co-occurrencesof higher

order can be utilized at the expense of space and, most importantly, time. For the scale of the applications we

envision, materializing co-occurrences of length higher than two is probably infeasible.

Two-word co-occurrences can be computed and stored efficiently as described in [16]. This involves the

computation of a sorted list consisting of triplets(wi, wj , 〈c(wi, wj)〉s). Every such triplet contains the number of

co-occurrences〈c(wi, wj)〉s of wordswi andwj for all document ratingss . A special tuple(wi, wi, 〈c(wi, wi)〉s)

stores the occurrences of wordwi. If two words inW do not co-occur we simply don’t store the corresponding

tuple.

Then, one can use the tuples in the list of two-word co-occurrences and “chain” together pairs of words in order

to progressively generate all word-sets of a collectionFr(q), while at the same time retrieving the corresponding

one-word and two-word counts. Thus, we can efficiently implement the iterator utilized by algorithm DIRECT,

which progressively retrieves word-set candidates and their low-order word co-occurrences.

Although we suggest the use of two-word co-occurrences and base the remainder of our presentation on this

assumption, all of our techniques can be easily adapted to handle the use of higher-than-two word co-occurrences.

Estimation of word co-occurrences

Having established a methodology for efficiently generating the candidate word-sets and retrieving the corre-

sponding single-word and two-word counts (c(wi), c(wi, wj)), we need to focus on how to utilize them in order to

estimatehigher-order co-occurrencesc(w1, . . . , wr). The estimation approach that we use is based on the widely

accepted Principle of Maximum Entropy [45] and has been successfully employed before [118, 123, 57, 108].

A basic observation is that a given word-setF = w1, . . . , wr defines a probabilistic experiment and conse-

quently a probability distribution over the possible outcomes of the experiment: Given an extended document

d ∈ D, we identify which of the wordswi of F are contained ind. We associate with each wordwi a binary

random variableWi, such thatWi = 1 if wi ∈ d andWi = 0 otherwise. Therefore, the experiment hasn = 2r

possible outcomes that are described by the joint probability distributionp(W1, . . . ,Wr).

If we had knowledge of that joint probability distribution we could easily estimate the number of co-occurrences

c(w1, . . . , wr) using its expected value:c(w1, . . . , wr) = p(1, . . . , 1)c(•), wherec(•) is the number of documents

in D. But although we do not know the distribution, we are not completely ignorant either: the pairwise co-

occurrences and single-word occurrences at our disposal provide us with some knowledge aboutp(W1, . . . ,Wr).

Example 4.5. In order to ease notation, let us concentrate on a word-setF = a, b, c of sizer = 3 that defines an


experiment withn = 8 possible outcomes and is described by the joint distribution p(A,B,C). Our fractional

knowledge about this distribution is in the form of simple linear constraints that we can derive from the pre-

computed co-occurrences. For example, we can estimate thatp(A = 1, B = 1) = c(a, b)/c(•). But p(A =

1, B = 1) = p(1, 1, 0) + p(1, 1, 1). In the same mannerp(A = 1) = c(a)/c(•) = p(1, 0, 0) + p(1, 0, 1) +

p(1, 1, 0) + p(1, 1, 1).

Let us introduce some notation that will allow us to describesuccinctly our knowledge about the joint distri-

butionp. Each of then = 2r possible outcomes of the experiment described byp is associated with a probability

value. Recall that each outcome is described by a tuple(W1, . . . ,Wr), where variableWi is either 0 or 1, signify-

ing the existence or not of wordwi in the document. Then, letp1 be the probability of outcome(0, . . . , 0, 0), p2 of

outcome(0, . . . , 0, 1), p3 of (0, . . . , 1, 0) and so on and so forth so thatpn is the probability of outcome(1, . . . , 1).

Therefore, the discrete probability distribution can be described by a vectorp = (p1, . . . , pn)T . Elementpn is

used to provide the high-order co-occurrence asc(w1, . . . , wr) = pnc(•).

As we discussed in Example 4.5, each two-word co-occurrencecountci provides us with some knowledge

about the distribution in the form of a linear constraintaTi p = ci. ai is a vector with elements that are either 1

or 0, depending on thepi’s that participate in the constraint. This is also true for the single word occurrences, as

well as the fact that the probabilities must sum up to 1. In total, we have at our disposalm = 1+ r + r(r − 1)/2

independentlinear constraints:r(r−1)/2 from the two-word co-occurrences,r from the single-word occurrences

and 1 from the fact that probabilities must sum up to 1. Therefore, our knowledge of the probability distribution

can be represented concisely in matrix form asAm×np = c, p ≥ 0, where each row ofA and the corresponding

element ofc correspond to a different constraint.

Example 4.6. LetF = a, b, c be a word-set. Thenr = 3, n = 8 andm = 7. We can describe our knowledge of

the distributionp ≥ 0 defined byF in matrix formAp = c, i.e.,

0 0 0 0 0 0 1 1

0 0 0 0 0 1 0 1

0 0 0 1 0 0 0 1

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

1 1 1 1 1 1 1 1

p1

p2

p3

p4

p5

p6

p7

p8

=

c(a, b)/c(•)

c(a, c)/c(•)

c(b, c)/c(•)

c(a)/c(•)

c(b)/c(•)

c(c)/c(•)

c(•)/c(•)

The constraints can also be viewed as a system of linear equations. However, the systemAp = c is under-

defined, as there are less equations (constraints) than variables (pi’s). Therefore, this information by itself does

not suffice to uniquely determine the joint probability distributionp. It is important to note that we could inject


additional constraints by utilizing information like the number of documents inD that contain wordwi, but not

wordwj . The number of such documents is simplyc(wi, wj) = c(wi)− c(wi, wj). However, all such additional

constraints can be derived by the original constraints defined byA andc, therefore no supplementary knowledge

can be gained in that manner.

When only partial information about a distribution is observed (such as in our case) the well-known information-

theoreticPrinciple of Maximum Entropy[45] is widely applied in order to fully recover it [118, 123,57, 108].

The principle maintains that since we have no plausible reason to bias the distribution towards a certain form, the

reconstructed distribution should be as symmetric and “uninformative”, i.e., as close to uniform(1/n, . . . , 1/n)

as possible, subject to the observed constraints. In that manner, no additional information other than the observed

constraints is injected.

More formally, the information entropy of discrete distributionp is defined asH(p) = −∑ni=1 pi log pi. The

uniquedistributionp∗ that maximizesH(p) subject toAp = c andp ≥ 0 is themaximum entropy distribution

and satisfies the aforementioned desirable properties. Having computed the maximum entropy distributionp∗,

we estimate the desired high-order co-occurrence asc(w1, . . . , wr) = p∗nc(•).

Example 4.7. Let us revisit Example 4.3 (Section 4.3.1) where we compare the surprise value of two possible

expansions for query “table, tennis”: with irrelevant term“car” and highly relevant term “paddle”. Using the

two-way word co-occurrences depicted in Figure 4.1 (Section 4.3.1) and the Maximum Entropy Principle, we

estimate that there exist 40 documents containing all threeterms “table, tennis, car” (true value is 40) and 462

containing “table, tennis, paddle” (true value is 500). While the reconstruction process does not perfectly recover

the original distribution, its accuracy is compatible withour goal of computing top-k expansions: we estimate

thatSurprise(table,tennis,car)=25 (true value is 25),Surprise(table,tennis,paddle)=2888 (true value is 3125), i.e.,

we are able to distinguish beyond doubt between interestingand non-interesting candidate expansions.

Entropy maximization is aconvex optimizationproblem [29]. Although there exists a variety of optimization

techniques for addressing convex problems, the special structure of the entropy-maximization task, its importance

and the frequency with which it is encountered in practice, has to led to the development of a specialized opti-

mization technique known asIterative Proportional Fitting(IPF) [25]. The IPF algorithm is an extremely simple

iterative technique that does not rely on the heavyweight machinery typically employed by the generic convex

optimization techniques and exhibits many desirable properties. In what follows, we offer a brief description of

the algorithm and highlight some of its properties. More details are available elsewhere [25].

Initially, vectorp is populated with arbitrary values. The choice of the starting point does not directly affect

the speed of the algorithm, while the starting values do not even need to satisfy the constraintsAp = c. Then,

the algorithm iterates over the linear equality constraints (the rows of matrixA) and scales by an equal amount


the variables ofp participating in the constraint, so that the constraint is satisfied. This simple process, converges

monotonically to the maximum entropy distribution.

A last point to note is that, due to the form of the entropy function H(p), if we scale the right hand side of

the problem constraints by a scalara > 0, i.e.,Ap = ac, then the optimal solution will also be scaled bya, i.e.,

the optimal solution will beap∗. Therefore, we can scale the right hand side of the constraints by c(•) so that

we directly use the low-order occurrence counts in the solution of problem (Example 4.6) and get the expected

number of co-occurrencesc(w1, . . . , wr) = p∗nc(•) directly from the value ofp∗n. The IPF procedure is also

unaffected by this scaling.

4.5 Working with Bounds

The query expansion framework implemented by algorithm DIRECT (Algorithm 4.1) incrementally generates all

candidate query expansions and for each candidateF it solves an entropy maximization problem to estimate the

co-occurrence countc(F ) from lower-order co-occurrences. Hence, the bulk of the computational overhead can

be attributed to the maximum-entropy-based estimation step. In this section, we focus our attention on reducing

this overhead. Let us begin by making two important observations that will guide us towards an improved solution.

• First, the IPF procedure, or any other optimization algorithm for that matter, “over-solves” the co-occurrence

estimation problem, in the sense that it completely determines the maximum entropy distributionp∗, rele-

vant to the candidate expansion under consideration. However, recall that we only utilize a single element

of p∗, namelyp∗n, which provides the required co-occurrence estimate (Section 4.4.3). The remainingn−1

values of the optimal solution vectorp∗ are of no value to our application.

• Second, besides requiring a single element from the maximumentropy distributionp∗, we do not always

require its exact value: in most cases a bound aroundp∗n would work equally well. Remember that we

only need to determine the top-k most interesting query expansions. Therefore, a bound on the estimated

co-occurrence count,which translates into a bound on the score of the expansion considered, might be

sufficient forpruningthe candidate: if the upper bound on the score of the candidate is less than the scores

of the top-k expansions that we have computed so far, we do not need to evaluate its exact score as it can

never make it to the top-k.

Hence, we require much less than what the IPF technique, or any other optimization algorithm provides: we

only need bounds on the value ofp∗n (high-order co-occurrence) instead of the exact solutionp∗ of the entropy

maximization problem.


In order to exploit this opportunity we develop ELLI MAX , a noveliterativeoptimization technique. ELLI MAX

is capable of computing the exact value ofp∗n, but does so by derivingprogressively tighter boundsaround it. As

we elaborate in Section 4.5.1, each iteration of the ELLI MAX technique results in a tighter bound aroundp∗n. This

is a property that neither IPF, nor any other optimization algorithm possesses.

The unique properties of the ELLI MAX technique are leveraged by algorithm BOUND (Algorithm 4.2), an

improved framework for computing the top-k candidate expansions. Algorithm BOUND processes candidate

expansions one at a time, as algorithm DIRECT does. However, it utilizes the ELLI MAX technique to progressively

bound the co-occurrences of candidate expansionF and consequently its score. The algorithm stops processing

candidateF as soon the upper bound on its score becomes less than the score of the expansions currently in the

top-k heap. In the case that a candidate cannot be pruned since it needs to enter the top-k heap, the ELLI MAX

technique is invoked until the bound on its score tightens enough to be considered a singular value.

The advantage offered by algorithm BOUND over algorithm DIRECT presented before is its ability to prune

candidate expansions that cannot appear in the top-k result,without incurring the full cost of computing their

exact score. In most cases, only a handful of ELLI MAX iterations should be sufficient for eliminating a candidate

from further consideration.

Before we proceed with the presentation of the ELLI MAX technique, let us briefly verify that a bound on

the estimated number of word co-occurrences is actually translated into a bound on the interestingness of the

candidate expansion, for all three scoring functions that we consider.

• Surprise (4.1): It is not hard to see that a bound on the estimated number of co-occurrencesa ≤ c(F ) ≤ b

bounds surprise between a/c(•)c(w1)/c(•)···c(wr)/c(•)

≤ Surprise(F ) ≤ b/c(•)c(w1)/c(•)···c(wr)/c(•)

.

• Average Rating (4.2): Let assume that we have obtained bounds ai ≤ c(F |si) ≤ bi. Additionally, let us

also assume that ratingssi are positive. Then, in order to get an upper bound onAvg(F ) we need to set the

numerator to its largest possible value and the denominatorto its smallest possible. Reasoning in the same

manner for the lower bound, we obtain∑

i siai∑i bi

≤ Avg(F ) ≤∑

i sibi∑i ai

. A similar process can provide us

with bounds when some of thesi’s are negative.

• Variance of Ratings (4.3): The variance equation is comprised of two terms. We can compute bounds

on the first term using the process we just described, while the second is simplyAvg(F )2, for which we

demonstrated how to derive bounds.

4.5.1 Progressive Bounding of Co-occurrences

The iterative ELLI MAX technique that we develop for providing progressively tighter bounds around the estimated

number of high-order co-occurrencesp∗n is based on the principles underlying the operation of the Ellipsoid


Algorithm 4.2 BOUND

Input : Queryq, expansion sizer, result sizek

TopK = ∅

Iterator.init(q, r)

while Iterator.hasMore() do

〈F,CountsF 〉 = Iterator.getNext()

Scoremin(F ) = −∞, Scoremax(F ) = +∞

while Scoremax(F )− Scoremin(F ) > ǫ do

for i = 1 to b do

Tighten[cmin(F |si), cmax(F |si)] using ELLI MAX

end for

Tighten[Scoremin(F ),Scoremax(F )] using[cmin(F |si), cmax(F |si)]

if Scoremax(F ) < TopK.thresholdthen

Break

end if

end while

Topk.update(F )

end while

return TopK


algorithm for solving Convex Optimization problems. We briefly survey these topics in Section 4.5.1, as they are

vital for understanding of the ELLI MAX technique, substantiated in Section 4.5.1.

Convex optimization and the Ellipsoid method

The entropy maximization problem, whose optimal solutionp∗ provides the estimatep∗n of the desired high-order

co-occurrences, is aconvex optimization2 problem [29].

min−H(p), Ap = c, p ≥ 0 (4.4)

Definition 4.3. A setD ∈ Rn is convex if∀x, y ∈ D and0 ≤ θ ≤ 1, θx + (1− θ)y ∈ D.

Less formally, a setD is convex if any line segment connecting two of its points lies entirely inD. Therefore,

convex domains are continuous subsets ofRn without any “cavities”.

Definition 4.4. A functionf : D → R, D ⊆ Rn is convexif its domainD is a convex set and∀x, y ∈ D and

0 ≤ θ ≤ 1, f(θx+ (1 − θ)y) ≤ θf(x) + (1− θ)f(y).

It is not hard to demonstrate that both the optimization function−H(p) and thefeasible areaof the problem,

defined by constraintsAp = c,p ≥ 0 areconvex. A desirable property of convex optimization problems is the

following.

Theorem 4.1. [29] Any locally optimal point of a convex optimization problem is also globally optimal.

A corollary of this important property is that most convex optimization problems have a unique optimal

solution. This is one of the reasons that convex optimization is a tractable problem, since it allows the development

of efficient, greedy iterative techniques (descent and interior-point) that progressively move towards the optimal

solution. Nevertheless, these algorithms areobliviousof their current distance to the optimum.

There exists, however, a different class of optimization techniques known as localization methods that pro-

gressively bound the optimal solution within a shrinking container. When the container becomes small enough, a

point inside it is used to approximate the solution. Most prominent among this class of algorithms is theEllipsoid

Method, which utilizes an ellipsoidal container to bound the optimal solution.

The Ellipsoid Method can be used to solve convex optimization problems of the formmin f(p) subject to

Ap ≤ b, wheref is a convex function. At a high level, it utilizes an ellipsoid in order to contain the problem’s

optimal solution. An ellipsoidE is described by means of a matrixP and its centero, so that the points inside it

satisfyE = p : (p− o)TP−1(p− o) ≤ 1.

2We formulate the entropy maximization problem as a minimization problem in order to conform with the established optimization terminology.


The algorithm commences with an ellipsoidE0 that contains the entire feasible region, as defined by the prob-

lem constraintsAp ≤ b. Then, at each iterationt, it queries anoracle which provides the algorithm with a

hyperplane passing through the current ellipsoid’s centerot. The hyperplane is described by a vectorht perpen-

dicular to the hyperplane. Using this representation, the pointsp on the hyperplane satisfyhTt (p− ot) = 0. The

guarantee we are offered by the oracle is that the optimal solutionp∗ is located on the positive side of the hyper-

plane, i.e.,hTt (p

∗ − ot) ≥ 0. Having obtained thisseparating hyperplane, the algorithm computes the unique

minimum volume ellipsoidEt+1 which contains the half of the current ellipsoidEt that lies on the positive side

of the hyperplane. Notice that the invariant maintained by this procedure is that the current ellipsoidEt always

contains the optimal solutionp∗. When the ellipsoid becomes small enough, we can use its center as an adequate

approximation to the optimal solution.

Figure 4.2 presents the first few iterations of the ellipsoidalgorithm for a2-dimensional problem whose fea-

sible region is highlighted in grey. Each iteration acquires a separating hyperplane passing through the ellipsoid’s

center and utilizes it in order to compute a smaller ellipsoid that still contains the optimal solutionp∗.

p* p*p*

p*

P0

o0

o1

o2

o3

P2

P1P3

Figure 4.2: The Ellipsoid Method.

Although the iterations of the ellipsoid algorithm might seem heavyweight, they are actually efficient and come

with a theoretical guarantee concerning the amount of shrinkage they accomplish. In case the cost functionf is

differentiable, as is the case for the entropy function, theseparating hyperplane is simply minus thegradientof the

function at the ellipsoid center, i.e.,ht = −∇f(ot) [29]. In the event the ellipsoid center lies outside the feasible

region, any violated constraint (rows fromAp ≤ b) can serve as a separating hyperplane. Having obtained

the separating hyperplane, determining the next ellipsoidinvolves a few simple matrix-vector multiplications

involving matrixPt and vectorsht, ot. The total cost of an iteration isO(n2) (n is the problem dimensionality)

and reduces the containing ellipsoid’s volume by at leaste−1

2(n+1) [26].


The ELLI M AX technique

The ellipsoid method offers a unique advantage not providedby IPF or any other convex optimization technique.

Namelythe progressively shrinking ellipsoid can be utilized to derive bounds on any elementp∗i of the optimal

solution. However, this process is far from straightforward. There are a number ofsignificant challengesthat

need to be addressedefficientlyin order to substantiate the ELLI MAX optimization technique to used by algorithm

BOUND.

• Remove equality constraints: As we discussed in our overview of the ellipsoid method for convex opti-

mization, it is applicable in the presence of inequality constraints of the formAp ≤ b. However, our

optimization problem (4.4) contains equality constraintsthat need to be efficiently removed.

• Update bounds aroundp∗n: We need to work out the details of how to translate the ellipsoidal bound around

the optimal solutionp∗ into a one-dimensional bound forp∗n.

• Identify a small starting ellipsoid: The ellipsoid method requires a starting ellipsoid that completely covers

the feasible region. Since our motivation for utilizing theellipsoid method is to derive a tight bound around

the optimum as fast as possible, it is crucial that we initiate the computation with the smallest ellipsoid

possible, subject to the constraint that its determinationshould be rapid.

In the remainder of the Section, we focus on providing efficient solutions for these tasks. The efficiency of the

solutions is also experimentally verified in Section 4.6.3.

Removing the equality constraints and moving to theλ-space

The ellipsoid method cannot handle equality the equality constraints of the entropy maximization problem

(Ap = c) because such constraints cannot provide a separating hyperplane in the case the ellipsoid’s center

violates one of them. Therefore, the problem needs to be transformed into an equivalent one that does not feature

equality constraints. In order to perform this transformation we utilize linear algebra tools [141].

Definition 4.5. Thenull spaceof matrixAm×n, denoted byN (A), is the space of vectorsr that satisfyAr = 0.

The null space is a(n−m)-dimensional subspace ofRn.

Lemma 4.1. [141] A vectorp with Ap = c can be described as the sum of two vectorsp = q + r, whereq is

any vector that satisfiesAq = c andr is a vector that lies in the null space ofA.

The null space ofA, like any vector space, can be described by anorthonormal basis, i.e., a set of orthogonal,

unit vectors. Such a basis can be computed using one of a number of available techniques, like the Singular Value

Decomposition. The basis consists ofg = n−m, n-dimensional vectors.


Lemma 4.2. [141] Let e1, . . . , eg with g = n −m be an orthonormal basis forN (A). Then,Ar = 0 ⇔ r =∑g

i=1 λiei, withλi ∈ R. To ease notation, letU = [e1 . . . eg] be a matrix whose columns are the basis vectors of

N (A). Then,Ar = 0 ⇔ r = Uλ.

The aforementioned lemmas allow us to eliminate the constraintsAp = c by simply enforcingp = q+ Uλ.

Observe that a vectorλ ∈ Rg fully defines a vectorp ∈ R

n. Then, we can substitutep in the cost function with

q + Uλ and express it as a function ofλ. We will denote the entropy function expressed as a functionof λ with

Hλ(λ). It is easy to show that−Hλ(λ) is also convex. Additionally, the constraintp ≥ 0 becomesUλ ≥ −q.

Putting it all together, the optimization problem that we need to address is:

min−Hλ(λ), Uλ ≥ −q (4.5)

Problem (4.5) is equivalent to problem (4.4), but (a) it doesnot contain equality constraints and (b) it is of

smaller dimensionalityg = n−m, sincep ∈ Rn while λ ∈ R

n−m. We will say that the original problem (4.4)

lies in thep-space, while the transformed problem (4.5) lies in theλ-space.

The remaining question is whether this transformation can be computed efficiently. The answer is positive

and this is due to a simple observation:matrixA is always the same for all instances of problem (4.4). Although

the constraints are different for every instance, what varies is the values of vectorc. This is intuitive, as the only

change from instance to instance are the low-order co-occurrence counts populating vectorc and not the way that

variablespi are related, which is described by matrixA (Section 4.4.3).

Therefore, the null space ofA and consequently matrixU can be pre-computed using any one of available

techniques [141]. We can also use pre-computation to assistus in determining an appropriate vectorq, such that

Aq = c. A solution to this under-defined system of equations can be computed by means of either theQR orLU

decomposition ofA [141]. As with the null space, the decomposition ofA can be pre-computed.

Translating the ellipsoidal bound in theλ-space to a linear bound in thep-space

At each iteration, the ellipsoid method provides us with an updated ellipsoidE = λ : (λ−o)TP−1(λ−o) ≤

1. The challenge we need to address is how to translate this ellipsoidal bound in theλ-space into a linear bound

for variablepn in thep-space. The following lemma demonstrates how this is done.

Lemma 4.3. Let E = λ : (λ − o)TP−1(λ − o) ≤ 1 be the bounding ellipsoid in theλ-space. Let us also

define vectord = (e1n, . . . , egn)T . Then, variablepn in thep-space lies in

pn ∈ [qn + dTo−√dTPd, qn + dTo+

√dTPd]

Proof. We can verify thatpn = qn + dTo+ dTm, with mTP−1m ≤ 1, using thep-space toλ-space mapping.

Due to the positive-definiteness ofP , there exists real, invertible matrixV such thatP = V V T . Thus, the


constraint onm becomes(V −1m)T (V −1m)≤ 1. We setv = V −1m and have thatpn = qn+dTo+(V Td)Tv,

with vTv ≤ 1. In other words,v is a vector of length at most 1 andpn is maximized when product(V Td)Tv

is maximized. In order to accomplish this, vectorv must be aligned with vectorV Td and its length must be set

to its maximum value, i.e., 1. Hence, the value ofv maximizingpn is v = V Td/||V Td||. By substituting (and

using the fact that√xTx = ||x||) we obtainpn’s maximum value.

Based on the aforementioned result, the translation of the ellipsoidal bound in theλ-space to a linear bound

for variablepn in thep-space can be computed analytically by employing a few efficient vector-vector and matrix-

vector multiplications.

Identifying a compact starting ellipsoid

Identifying a compact starting ellipsoid is an integral part of the ELLI MAX technique. Nevertheless, determin-

ing such an ellipsoid presents a performance (computation cost)/efficiency (ellipsoid size) trade-off. For example,

we can formulate the problem of identifying the minimum volume ellipsoid (known as Lowner-John ellipsoid)

covering the feasible region of problem (4.5) as a convex optimization problem [29]. However, the resulting prob-

lem is harder than the problem we need to solve. Therefore, itdoes not make sense to determine the best possible

starting ellipsoid at such cost.

In what follows, we present ananalyticalprocedure for identifying a compact starting ellipsoid. The procedure

is efficient, as it involves a handful of lightweight operations, and is comprised of three steps: (a) identifying an

axis-aligned bounding box around the feasible region in thep-space, (b) using the box in thep-space to derive

an axis-aligned bounding box in theλ-space and (c) covering the latter bounding box with the smallest possible

ellipsoid.

Let us concentrate on the first step of the procedure. The feasible region in thep-space is described by the

linear constraintsAp = c andp ≥ 0. Every row ofA along with the corresponding element ofc, define a

linear constraint that sums some of thepi’s so that they equal a valuec (Example 4.6). Then, sincep ≥ 0, the

elements ofp that participate in the constraint cannot be greater thanc, since that would require some element in

the constraint to be negative.

Therefore, by iterating once over all the constraints inAp = c we can get an upper bound for every element

pi. At that point we can make another pass over the constraints in order to determine a lower bound tighter than

0: by setting all variables but one to the maximum value that they can assume, we can use the linear equality to

derive a lower bound for our free variable. By repeating thisprocess for all constraints and variables we identify

lower bounds for all variablespi.

Example 4.8. Suppose that we only have 2 constraintsp1+p2 = 2 andp2+p3 = 4. From the first constraint we

can derive thatp1 ≤ 2 andp2 ≤ 2, while the second one providesp2 ≤ 4 (for which we already acquired a better


bound) andp3 ≤ 4. We then use these upper bounds to derive the lower boundsp1 ≥ 2 − p2,max ⇒ p1 ≥ 0,

p2 ≥ 2− p1,max ⇒ p2 ≥ 0 andp3 ≥ 4− p2,max ⇒ p3 ≥ 2.

We now need to translate the derived boundsai ≤ pi ≤ bi into bounds in theλ-space. Letvi = (0, . . . , 1i, . . . , 0)T

be a basis vector of thep-space. Then,

p = q+

g∑

i=1

λiei and p =

n∑

i=1

pivi ⇒ q+

g∑

i=1

λiei =

n∑

i=1

pivi

Let us multiply both sides of the equation with vectorek, keeping in mind that vectorsei form an orthonormal

basis.

q+

g∑

i=1

λiei =

n∑

i=1

pivi ⇒ λk = −qTek +

n∑

i=1

pi(vTi ek)

Getting a bound onλk is now straightforward. Each constraintai ≤ pi ≤ bi is multiplied by(vTi ek), so that

ai(vTi ek) ≤ pi(v

Ti ek) ≤ bi(v

Ti ek), if (vT

i ek) > 0

bi(vTi ek) ≤ pi(v

Ti ek) ≤ ai(v

Ti ek), if (vT

i ek) < 0

Adding these constraints up and further adding−qTek to both the lower and upper bound, we get a bound on

λk.

Lastly, the minimum volume ellipsoidV covering an axis-aligned box can also be analytically determined.

Intuitively,V is an axis-aligned ellipsoid whose center coincides with the box’s. The length ofV ’s axis parallel to

theλi-axis is equal to√g times the box’s extent across that axis.

Lemma 4.4. Consider ag-dimensional, axis-aligned box so thatλmini ≤ λi ≤ λmax

i . Then, the minimum volume

ellipsoidV = λ : (λ− o)TP−1(λ − o) ≤ 1 covering the box is

o = (λmin1 + λmax

1

2, . . . ,

λming + λmax

g

2)T

Pii = g(λmax

i − λmini )2

4and Pij = 0

Proof. Let us prove the lemma for the a 2-dimensional case and a box centered at the origin. The argument applies

to the general case. Consider a box defined by constraints−a ≤ λ1 ≤ a and−b ≤ λ ≤ b, with a > b. According

to the lemma, we have that the minimum volume ellipsoidV enclosing the box is defined by matrix:

P =

2a2 0

0 2b2

The ellipsoid and the box are presented in Figure 4.3.


λ1

λ2

λ1

λ2

A’A

a

b

Figure 4.3: Determining the minimum volume covering ellipsoid.

The points inside the ellipsoid satisfyλ21

2a2 +λ22

2b2 ≤ 1. It is easy to demonstrate that all points−a ≤ λ1 ≤ a

and−b ≤ λ ≤ b, including the most extreme points of the box, such as cornerA = (a, b), satisfy the inequality

and therefore lie inside the ellipsoid. Hence, the ellipsoid encloses the box.

Let us demonstrate thatV is the minimum volume enclosing ellipsoid. Computation of the minimum volume

ellipsoid surrounding a convex body is a convex optimization problem [29]. Therefore, it suffices to show thatV

is locally optimal.V can be perturbed in the following ways: increase the length of its axes, decrease their length,

move its center or rotate it. Obviously, by increasing the axes length, the resulting ellipsoid is no longer optimal,

since its volume has increased. By shrinking the ellipsoid,or moving its center, the ellipsoid no longer covers the

entire box. The same is true for rotating the ellipsoid.

Let us demonstrate that even a small rotation of the ellipsoid results in the box no longer being covered. For

convenience, let us assume that instead of rotatingV , we rotate the box by angleθ. Let us prove that forθ > 0, the

box’s cornerA′ no longer satisfiesλ21

2a2 +λ22

2b2 ≤ 1. We haveA′ = (a cos θ+b sin θ,−a sin θ+b cosθ). Therefore:

λ21

2a2+

λ22

2b2A′

= cos2 θ +1

2sin2 θ(

a2

b2+

b2

a2) + ab cos θ sin θ(

a

b− b

a)

Becausea > b, we have thata2

b2 +b2

a2 > 2 andab− b

a > 0. Furthermore, becauseθ > 0, we havecos θ sin θ > 0.

Hence:

cos2 θ +1

2sin2 θ(

a2

b2+

b2

a2) + ab cos θ sin θ(

a

b− b

a) > cos2 θ + sin2 θ = 1

We can demonstrate that a rotation withθ < 0 results in the box no longer being covered, by focusing on box

cornerB = (a,−b) instead ofA = (a, b).


Algorithms DIRECT and BOUND can be used to support the proposed query expansion functionality for extended

document collections. The algorithms quantitatively evaluate the interestingness of candidate query expansions


and single out thek expansions with maximum (or minimum) score values (Problems 1, 2 and 3).

Algorithm DIRECT (Section 4.4) incrementally generates all candidate queryexpansions and for each can-

didateF it (a) solves an entropy maximization problem (many in the case of Problems 4.2 and 4.3) in order to

estimate the number of co-occurrences of the words comprisingF and (b) uses this estimate to computeF ’s score.

Algorithm DIRECT uses the IPF technique in order to solve the entropy maximization problem.

Algorithm BOUND (Section 4.5) improves upon DIRECT by exploiting the natural top-k formulation of the

query expansion problem. By leveraging the ELLI MAX technique (neither IPF nor any other algorithm can be

used) in solving the entropy maximization problem, it can progressively bound the score of a candidate expansion

F and eliminate it as soon its upper bound is lower than the candidates in the top-k heap.

In what follows we study the performance characteristics displayed by the two algorithms using both synthetic

and real data. The results validate the performance improvements promised by the use of algorithm BOUND.

4.6.1 Expansion Length and Problem Dimensionality

Let us now briefly discuss how the lengthr of the candidate query expansions affects the overall complexity of our

problem. This length defines the size of the resulting entropy maximization problem which in turns contributes to

the difficulty of our problem.

Estimating the occurrences for a word-set of sizer requires the solution of a convex optimization (entropy

maximization) problem involvingn = 2r variables andm = 1 + r + r(r − 1)/2 constraints (Section 4.4.3). In

the case of the ELLI MAX technique we introduced in Section 4.5.1, removal of them equality constraints results

in an optimization problem that lies in theg-dimensionalλ-space, whereg = n − m. Table 1 summarizes the

values of these variables for three reasonable values ofr.

n m g

r = 3 8 7 1

r = 4 16 11 5

r = 5 32 16 16

Table 4.1: Word-set size effect on problem size.

As the length of the candidate expansions increases, so doesthe complexity of the entropy maximization

problem. This has an adverse effect on the running time of both the IPF (used by DIRECT) and the ELLI MAX

(used by BOUND) algorithms. An interesting observation is that forr = 3, the ELLI MAX technique handles a

1-dimensional problem. Effectively, the feasible region of the convex optimization problem in theλ-space is a line

segment. In this case, the ELLI MAX technique collapses to a bisection method that bounds the optimal solution

by iteratively cutting the feasible line segment in half.


4.6.2 Experimental Setting

Both the IPF process and the ELLI MAX algorithm are optimization techniques that iteratively converge towards

the problem’s optimum. The methods terminate when they are able to provide an approximation to the optimal

solution within a desired degree of accuracy. In our experiments, we set this accuracy to10−6 in absolute terms,

i.e., we declared convergence when|pn − p∗n| < 10−6.

In order to guarantee that the convergence condition is met by the IPF, we required that two iterations fail to

change the values of all variablespi by more than10−6 [118]. At this point, the IPF’s most recent estimate forp∗n

is returned to Algorithm 4.1. The ELLI MAX technique derives progressively tighter bounds around therequired

optimal valuep∗n. Unless the method is terminated by algorithm BOUND, due to the algorithm being able to prune

the candidate, the ELLI MAX technique stops when the bound aroundp∗n becomes smaller than10−6. Then, the

middle-point of the interval is used to approximate the truevalue ofp∗n with accuracy within10−6.

In our timing experiments we concentrated on the CPU time required by algorithms DIRECT and BOUND,

since the I/O component is identical for both algorithms. The CPU time of algorithm DIRECT is consumed

by IPF calls, while the CPU time of BOUND is consumed by ELLI MAX iterations. Our test platform was a

2.4Ghz Opteron processor with 16GB of memory, although bothoptimization techniques have miniscule memory

requirements. The methods were implemented in C++, while the vector and matrix operations were supported by

the ATLAS linear algebra library [135].

4.6.3 Experimental Results

Evaluation of the ELLI M AX technique

The initialization of the ELLI MAX technique includes the removal of the equality constraintsand the transition to

theλ-space where the algorithm operates, as well as the computation of a compact starting ellipsoid. An iteration

involves the computation of a cutting hyperplane, its use inupdating the ellipsoidal container and the derivation

of a linear bound for variablep∗n. Each iteration guarantees a reduction of the ellipsoid’s volume by at least a

certain amount. Although this reduction cannot be directlytranslated into a reduction in the size of the bound

aroundp∗n, it provides us with some information about the effectiveness of each iteration. Table 2 summarizes

this information.

As we can observe, both the initialization and iteration operations are extremely efficient. Nevertheless, as the

problem size increases (expansion lengthr), two sources contribute in the method’s performance degradation: (a)

the iterations are more expensive and (b) more iterations are required in order to decrease the bound aroundp∗n by

an equal amount. Such trends are consistent as the value ofr increases.


Initialization (µs) Iteration (µs) Vol. reduction%

r = 3 1.8 0.5 50

r = 4 4.6 2.6 8

r = 5 12.00 7.9 3

Table 4.2: Performance of the ELLI MAX method.

Synthetic data

For our first set of experiments, we applied both algorithm DIRECT and algorithm BOUND to a stream of 100k

synthetically generated candidate expansions and measured the total CPU time (spent in IPF calls by DIRECT and

ELLI MAX iterations by BOUND) required in order to identify the top-10 expansions. As we discussed, since the

purpose of computing expansions is to present them to a user,only a handful of them need to be computed.

A candidate expansion is generated by assigning values to the low-order co-occurrences that describe it.

Equivalently, for every candidate we need to assign values to the constraint vectorc of its corresponding entropy

maximization problem (4.4). For Problems 2 and 3, the required low-order co-occurrences conditioned on the

document rating, are generated for each rating value independently.

Our first data generator, denoted byU , works as follows. Instead of directly populating vectorc, its values

are generated indirectly by first producing the underlying probability distributionp that the maximum entropy

process attempts to reconstruct. Recall that low-order co-occurrences appearing in vectorc are related to the

underlying distribution that we estimate through constraintsAp = c. Then, sinceA is fixed andp known, we can

generate vectorc. These constraints are theonly information used by IPF and ELLI MAX in order to estimatep∗n.

The data distribution vectorp is populated with uniformly distributed values from a specified range ([5, 10000] in

our experiments). We experimented with other skewed data distributions and the results were consistent with the

ones we subsequently present.

During our experimentation we observed a dependence of the performance of the IPF technique (and conse-

quently of algorithm DIRECT) on the degree ofpairwise correlationbetween the words comprising a word-set.

We quantify the correlation between wordswi andwj by employing ratiosc(wi, wj) /c(wi) andc(wi, wj)/c(wj):

the closer the value of these ratios is to 1, the more frequently wordswi andwj co-occur in the document collec-

tion. As we will discuss in more detail, we observed that the convergence rate of IPF was adversely affected by

the presence of strong correlations.

Our second data generator, denoted byC, synthetically produces co-occurrences with a varying degree of

pairwise correlation. Its first step is to randomly generatetwo-word co-occurrencesc(wi, wj) from a uniform

distribution over the interval[100, 1000]. These co-occurrences are subsequently used in the generation of the

single word occurrencesc(wi): for every wordwi, the ratiomaxj c(wi, wj) /c(wi) is sampled uniformly from an


interval[a, b]. Using this ratio we can derive a value forc(wi). Controlling the interval[a, b] allows us to control

the degree of pairwise correlation we inject. Therefore, weexperimented with 5 data sets:C0, where we sample

from the interval[0.01, 0.99], C1 from the interval[0.01, 0.25], C2 from [0.25, 0.50], C3 from [0.50, 0.75] and

C4 from [0.75, 0.99]. Intuitively, data setC0 is comprised of candidates with a varying degree of correlation,

while data setsC1 toC4 contain candidates that exhibit progressively stronger correlations.

The experimental results for Problem 1 are presented in Figure 4.4. The left bar chart depicts the results for

expansions of sizer = 3, while the right bar chart for expansions of sizer = 4. At a high level, it is evident that

algorithm BOUND clearly outperforms DIRECT by orders of magnitude. BOUND’s superior performance can be

attributed to the fact that once a few of highly-surprising candidates are encountered, the pruning of subsequent

candidates is relatively easy, requiring only fewELLI MAX iterations.

U C0 C1 C2 C3 C40

10

20

30

40

Data Set

Tim

e (s

)

Problem 1, r=3

U C0 C1 C2 C3 C40

50

100

150

200

250

300

Data Set

Tim

e (s

)

Problem 1, r=4

DirectBound

DirectBound

71s 553s

Figure 4.4: Problem 1 performance on synthetic data sets.

Additionally, the performance of DIRECT in Figure 4.4 verifies our previous observation, i.e., IPF’sperfor-

mance decreases as the degree of two-way correlation between the words increases (data setsC1 to C4). The

absence of pairwise correlations suggests that the underlying text is mostly uniform. Given that the maximum

entropy principle underlying the operation of IPF is essentially a uniformity assumption, this effect is understand-

able: the more the ground truth about the data distribution (pairwise co-occurrences) deviates from the technique’s

assumptions (uniformity), the slower its convergence.

We consider this behavior as a major drawback of the IPF technique and algorithm DIRECT since we are, by

the definition of our problem, interested in discovering words that are highly correlated and define meaningful and

significant clusters of documents. This trend in the performance of IPF is consistent throughout our experimental

evaluation.

For Problems 2 and 3 we used three ratings: 0, 1 and 2. Our experimental results are depicted in Figures

4.5 and 4.6 respectively. Forr = 3 the BOUND algorithm outperforms DIRECT by a large margin. Forr = 4

the image is mixed, although the BOUND algorithm performs clearly better for all data sets other thanC1 and


C2. Recall that these two data sets exhibit exclusively very low correlations, a scenario which as we discussed

is beneficial for IPF and algorithm DIRECT. In practice we expect hardly ever to encounter close to uniform

(uncorrelated) text. Correlations are prevalent in real data sets and this points to the advantage of our proposal.

U C0 C1 C2 C3 C40

20

40

60

80

100

Data Set

Tim

e (s

)

Problem 1, r=3

U C0 C1 C2 C3 C40

200

400

600

800

1000

Data Set

Tim

e (s

)

Problem 1, r=4

DirectBound

DirectBound

208s 1647s


The reason for the, perhaps surprising, gap in the performance of the BOUND algorithm fromr = 3 to r = 4 is

the following. Due to the complexity of scoring functions (4.2) and (4.3), we need to derive relatively tight bounds

on the estimated co-occurrences in order for them to be translated into a sufficient for pruning bound around the

expansion’s score. But, in order to achieve these bounds, BOUND for r = 4 must perform more, yet less efficient

ELLI MAX iterations than forr = 3 (Section 4.6.3).

U C0 C1 C2 C3 C40

20

40

60

80

100

Data Set

Tim

e (s

)

Problem 1, r=3

U C0 C1 C2 C3 C40

200

400

600

800

1000

Data Set

Tim

e (s

)

Problem 1, r=4

DirectBound

DirectBound

209s 1670s


We also performed experiments to examine whether enabling the use of more rating values adversely affects

the pruning performance of the BOUND. Figure 4.7 presents its performance for documents with 3, 5and 10

possible ratings, under theU data set. The running time scales linearly, as desired: 3, 5 and 10 instances of the

ELLI MAX technique need to run and provide bounds for countsc(F |si) in parallel, therefore this linear scale-up

is expected. A super-linear increase would imply a reduction in pruning efficiency, but this was not observed. The

performance of IPF also scales linearly and is therefore omitted from the graphs. This result was consistent for


all synthetic data sets.

2 4 6 8 100

0.5

1

1.5

2

2.5

3

3.5r=3

# of Possible Ratings

Tim

e (s

)

2 4 6 8 100

100

200

300

400

500

600r=4

# of Possible Ratings

Tim

e (s

)

Bound, Problem 2Bound, Problem 3

Bound, Problem 2Bound, Problem 3

Figure 4.7: Performance vs number of possible ratings.

Real data

We also performed experiments using massive, real data setscomprised of blog posts. To evaluate the performance

of the techniques on Problem 1, we used a sample consisting of2.7 million posts, retrieved daily during February

2007. In order to reduce the search space of the algorithms and prune uninteresting co-occurrences, we only

maintained countsc(wi, wj) such thatc(wi, wj)/c(wi) > 0.05 andc(wi, wj)/c(wj) > 0.05. We posed random

single-keyword queries and present the average CPU time required by the techniques in Figure 4.8. As it is

evident, the BOUND method has a clear advantage over DIRECT, offering significantly better performance.

Direct Bound0

1

2

3

4

5

6

7r=3

Method

Tim

e (s

)

Direct Bound0

20

40

60

80

100r=4

Method

Tim

e (s

)

595s

Figure 4.8: Problem 1 performance on real data.

In order to evaluate performance on Problems 2 and 3, we used asample of 68 thousand posts from the day

of 13/02/2008. We used a custom sentiment analysis tool based on [120] to associate each post with a rating (0

for negative, 1 for neutral and 2 for positive). As before, weremoved uninteresting co-occurrences to reduce the


search space in a similar manner and posed random single-word queries. Figure 4.9 presents the average CPU

time required by the algorithms to solve Problem 2. The results for Problem 3 followed the same trend and are

therefore omitted. As it is evident, the pruning opportunities exploited by the BOUND algorithm enable us to

deliver superior performance to DIRECT.

Direct Bound0

1

2

3

4

5r=3

Method

Tim

e (s

)

Direct Bound0

10

20

30

40

50

60

70r=4

Method

Tim

e (s

)

120s

Figure 4.9: Problem 2 performance on real data.

Qualitative results on real data

Finally, we conclude our experimental evaluation with a brief case study regarding the query expansions suggested

by our technique. We used blog post data for the day of 26/08/2008, processed by the same sentiment analysis

tool as before. Only stop-words were removed. Below we present the top-3 expansions suggested using the

scoring functions discussed in Section 4.3.1, for a small sample of product-related queries, in order to verify

our claims from Sections 4.1 and 4.3 and demonstrate the utility of the query expansion framework. Notice that

the expansions are indeed comprised of terms (mostly products and product features) relevant to the original

query, e.g., “toner” (feature) for query “hp printer” (product), or “eee netbook” (product line) and “acer netbook”

(competitor) for query “asus” (manufacturer).

The expansions suggested would be invaluable for interacting with the underlying document collection. For

example, we query thegeneric and diverseblog post collection for the day of 26/08/2008 in order to unearth

discussion, reviews and opinions about “hp printer(s)”. Without further assistance from the query expansion

framework, manual browsing of hundreds matching blog postswould be the sole option for discovering more

useful information about the subject. The expansions allowus to rapidly identify and focus on a particularly

interesting slice of the relevant posts. For instance, expansions with high surprise value identify a subset of posts

discussing particular features of interest, e.g, “cartridges”. Alternatively, we have the option to focus on posts

with low average score, discussing a problematic aspect of the product, e.g., “graphics”.


Query Surprise Max. Avg. Score Min. Avg. Score

hp printer

cartridge network toner

laser mobile cartridge

ink ebay graphics

nikon digital

autofocus viewfinder slr standard canon click

cmos viewfinder slr tone camera file

cmos autofocus slr stabil manual images

asus

acer netbook laptop camera eee install

eee netbook mobile model eee linux

acer eee core mobile laptop pc

Table 4.3: Case study.

4.7 Extensions

In Section 4.3 we formally introduced measure-driven queryexpansion, our novel approach for suggesting “in-

teresting” expansions to a keyword query in a data-driven manner. Section 4.3.1 introduced three measures for

quantifying the interestingness of an expansion. While oursubsequent discussion was focused on the efficient

computation of top-k expansions that maximize or minimize each of the three measures, the computational frame-

work presented in Sections 4.4 and 4.5 can support an extended set of measures.

Algorithm 4.2 is a framework for efficiently enumerating candidate expansions and providing for each expan-

sionF progressively tighter bounds around the maximum entropy estimate for countsc(F |si), i.e., the number of

documents with ratingsi containing all the words comprisingF . Hence, it can be used in conjunction with any

measureM(F ) of interestingness that possesses the following two properties:

• The main challenge in evaluatingM(F ) is the need to compute countsc(F |si).

• Boundsai ≤ c(F |si) ≤ bi around valuesc(F |si) can be translated into boundsa ≤ M(F ) ≤ b around

M(F ).

The benefit of using Algorithm 4.2 lies in its ability to use progressively tighter bounds around estimatesc(F |si)

to boundM(F ) and hence eliminate it from further consideration when the upper bound onM(F ) becomes

smaller than the top-k measure values computed so far. The following examples illustrate its use with more

elaborate measures than the ones considered so far.

Example 4.9. In Section 4.3.1 we advocated the use ofsurprise(unexpectedness) in order to quantify the in-

terestingness of an expansion. The quantitative definitionof surprise we employed (Equation 4.1), while widely

accepted in the literature [30, 53], is not unique [20]. For instance, we might wish to render measure 4.1 more


robust by also demanding thatc(F ) > T , i.e., demand a minimum support on the number of co-occurrences

of the words inF . Another possibility would be to utilize measures of surprise inspired from hypothesis testing

and information theory [62], such as(c(F )− e(F ))2

e(F )or c(F ) log

c(F )

e(F ), wheree(F ) is the expected number of

documents containing the words inF assuming independence. All such measures can be efficientlysupported by

Algorithm 4.2.

Example 4.10. The measures described by Equations 4.2 and 4.3 single-out expansions based on the charac-

teristics of the rating distribution associated with the corresponding documents. We can further demand that

an expansion is also associated with a minimal support or surprise value. Such potential enhancements can be

supported in a straightforward manner.

4.8 Conclusions

In this Chapter we introduced a new data analysis and exploration model for extended documents that enables

the progressive refinement of a keyword-query result set. The process is driven by suggesting expansions of the

original query with additional search terms and is supported by an efficient framework, grounded on Convex

Optimization principles.

Part II

Querying

86

Chapter 5

Improved Search for Socially Annotated

Documents

5.1 Introduction

In this Chapter we study a document search paradigm where extended documents are retrieved based on the

relevance of a keyword query to their accompanying tag meta-data. The advantages of such an approach are

immense. When annotating a document, users distil its complex content into an accurate and concentrated textual

summary. Subsequent aggregation of the individual opinions into a collective wisdom serves to eliminate noise

and increase our confidence, thus offering an accurate description of the document’s content. Consequently, social

annotation has the potential to enhance our ability to identify and retrieve relevant textual objects in response to a

keyword query, since we no longer need to infer by means of heuristics which of the words and phrases present in

the text are truly representative of its content. Indeed, early results investigating the utility of social annotationin

improving search quality (Section 2.3, [149, 152, 17, 73, 32]) indicate that tags contain incremental information

not present in the document’s textual component and other meta-data such as links.

In this spirit, we proposeRadING(Rankingannotateddata usingInterpolatedN-Grams), a principled and

efficient search and ranking methodology for extended documents that exclusively utilizes the user-assigned tag

sequences. The solution employs powerfulinterpolatedn-gramsto model the tag sequences associated with each

document, motivated by our growing understanding of the dynamics and structure of the social annotation process.

The interpolatedn-grams employed are a more robust variation of vanillan-gram models – commonly used to

model keyword (and more generally event) sequences – that linearly combine information from all lower order

n-grams, i.e,n-grams,(n−1)-grams and so on. In our application, the use of interpolatedn-gram models exposes

87

CHAPTER 5. IMPROVED SEARCH FORSOCIALLY ANNOTATED DOCUMENTS 88

significant and highly informative tag co-occurrence patterns (correlations) present in the user assignments. Our

ranking strategy leverages this information in order to identify the extended documents most relevant to a query

and rank them accurately.

In order to guarantee the scalability of our approach to document collections with hundreds of millions of

documents, annotated with thousands of assignments, we also introduce a novel optimization framework, em-

ployed both in the training and the incremental maintenanceof the interpolatedn-gram models. The optimization

framework is able to rapidly identify the optimal weightingthat must be assigned to each lower ordern-gram, as

well as efficiently update them as documents accumulate new assignments.

More specifically, we make the following contributions:

• We present RadING, a principled search and ranking methodology that utilizes interpolatedn-grams to

model the tag sequences associated with every extended document. The approach is based on solid proba-

bilistic foundations and our insight of the collaborative tagging process.

• The training and the incremental maintenance of the interpolatedn-gram models is performed by means of

a novel constrained optimization framework that employs powerful numerical optimization techniques and

exploits the unique properties of both the function to be optimized and the parameter domain. We demon-

strate that our framework outperforms at both tasks and by a large margin other applicable techniques.

• We experimentally validate the effectiveness of the proposed ranking methodology and the efficiency of

then-gram training and maintenance framework using data from a large crawl of del.icio.us, numbering

hundreds of thousands of users and millions of Web pages, thus demonstrating the applicability of our

approach to real-life, large scale systems.

Although RadING relies exclusively on the tag meta-data associated with extended documents, it is compatible

with existing, traditional search and ranking approaches that leverage the documents’ textual component or other

categories of meta-data such as hyperlinks [106]. The ranking or conditional document probabilities generated

by RadING can be easily combined with rankings or scores generated by other approaches into a final result

[85, 106, 55]. The power of RadING lies in extracting as much information as possible from user assignments.

Part of the work presented in this Chapter also appears in [132] and is organized as follows: In Section 5.2

we review existing work. In Section 5.3 we present and motivate then-gram based ranking methodology, while

in Section 5.4 we present our noveln-gram training solution. In Section 5.5 we briefly discuss how a real-life

system can implement the proposed searching solution. Section 5.6 presents our experimental evaluation and,

lastly, Section 5.7 offers our conclusions.



In Section 2.3 we reviewed efforts in leveraging social annotations associated with extended documents in or-

der to improve ranking performance [76, 149, 152, 17]. The present solution differs from this body of work in

three major aspects. First, our approach reasons in a principled probabilistic manner about the relation between

queries and user assigned tags, incorporating our growing understanding of the dynamics and structure of the so-

cial annotation process. Previous work relied on ad-hoc heuristics. Second, we study scalability and incremental

maintenance issues pertaining to our approach. Lastly, theeffectiveness and efficiency of our solution is exten-

sively evaluated on a large real-life data set, orders of magnitude larger than the ones previously employed, while

relevance assessments are performed by a large number of independent and unknown to us judges.

Language models for information retrieval have been previously introduced [140, 111]. Our approach is es-

sentially a language model; however it is based on differentprinciples (pertinent to our specific application) and

the resulting ranking strategy is obtained through a different theoretical derivation. Furthermore, unlike previ-

ous approaches, we are also concerned with the efficiency andincremental maintenance issues of the proposed

solution.

5.3 Principled Ranking of Extended Documents with Tag Meta-Data

In this section we derive and motivate the RadING searching and ranking strategy. We begin by presenting its

solid foundations in probabilistic information retrieval[140, 111] and our understanding of the social annotation

process [64, 70], which is subsequently modeled by employing progressively more sophisticatedn-gram models

[86, 105, 65].

5.3.1 Probabilistic Foundations

The basis of probabilistic information retrieval is the ranking of documents according to the probability of each

document being relevant to the query [140, 111]. Given a keyword queryq and a collection of socially annotated

extended documentsD, it is desirable to rank them in descending order ofp(d is relevant|q), whered ∈ D. By

applying Bayes’ rule we have that

p(d is relevant|q) = p(q|d is relevant)p(d is relevant)p(q)

The termp(d is relevant) is the prior probability that documentd is relevant, independently of the query

being posed. This term can potentially be used to bias the ranking towards certain categories of documents in

a domain-, application- or even user-specific manner. In what follows, we assume that this prior probability is


constant throughout our document collection, without affecting the analysis and results that will be subsequently

presented. We revisit this issue and offer our suggestions for non-uniform priors that could be employed in Section

5.6. Termp(q) is the probability of the query being issued, which, since the query is given, is constant for all

documents and therefore does not affect their relative ranking.

Based on the aforementioned observations, ranking the documents according top(d is relevant|q) is equiva-

lent to a ranking based onp(q|d is relevant). This term captures the intuition that a document can be retrieved by

a number of different queries, however not all queries are equally likely to be used for this purpose.

Example 5.1. Consider the Web page of the Firefox browser. Perhaps the most reasonable, and therefore prob-

able, keyword query that one would use in order to retrieve the Web page is “firefox”. Nevertheless, it is not the

unique query that can potentially be used to identify the Webpage: “mozilla browser” or “open source browser”

are other perfectly valid query candidates that we can expect, albeit with a lower probability.

More formally, we established that:

p(d is relevant|q) ∝ p(q|d is relevant)

This simple transformation implies that documents need to be modeled so that we can estimate the probability

of the query being “generated” by each document. While the problem of ranking query results has been viewed

from this perspective before [140, 111], the appropriate modeling of the documents is decisive in producing an

intuitive ordering and is sensitive to the characteristic of the application domain. In what follows, we will discuss

how this probability can be modeled in a meaningful and principled manner by studying the social annotation

process and motivating the use of language models.

5.3.2 Dynamics and Properties of the Social Annotation Process

Users annotate documents in order to facilitate their future retrieval. We assign to a document the tags that we

would instinctively use in the future in order to retrieve itfrom our personal collection of documents. Therefore,

although we tag documents in a personal and idiosyncratic manner, the underlying goal of the tagging process is

to describe the document’s content in a concise and accuratemanner, so that we can easily locate it when the need

arises.

Even though an extended document is annotated by hundreds orthousands of individuals, its content can only

be viewed from a limited number of perspectives, so that evenafter witnessing a small number of assignments, we

should be able to identify the annotation trends associatedwith these perspectives. The annotation of an extended

document with additional assignments will increase our confidence in the trends already identified, but is unlikely

to unveil a fresh prevalent perspective.


This intuitive observation has also been validated by previous work on the dynamics of collaborative tagging.

[64, 70] demonstrated that the distribution of tags for a specific document converges rapidly to a remarkably stable,

heavy tailed distribution that is lightly affected by additional assignments. The heavy tailed distribution ascertains

the dominance of a handful of influential trends in describing a document’s content. The rapid convergence and

the stability of the distribution points to its predictability: namely, after witnessing a small number of assignments,

we should be able to predict with a high degree of confidence subsequent tag assignments.

Given the fast crystallization of users’ opinion about the content of a document, we can make a natural as-

sumption that will serve as a bridge between our ability to predict the future tagging activity for a document and

our need to computep(q|d is relevant).

Users will use keyword sequences derived from the same distribution to both tag and search for a document.1

This logical link allows us to equate the probabilityp(q|d is relevant) to the probability of an assignment

containing the same keywords asq being used to tag the document, i.e.,

p(q|d is relevant) = p(q is used to tagd)

The stability of the tag distribution allows us to accurately estimate the probability of a tag being used in the

future, based on the document’s tagging history. However, assignments are rarely comprised by a single tag. In

our study (Section 5.6) we observed that the average length of an assignment is2.77 tags. It is reasonable to

expect that neither the order in which tags are placed in an assignment, nor the co-occurrence patterns of tags in

assignments are random.

In fact, [64] observed that tags are not used in random positions within an assignment, but rather progress

(from left to right) from more general to more specific and idiosyncratic. Therefore, assignments are not orderless

sets of tags, but sequences of tags, whose ordering tends to be consistent across the assignments attached to a

document, and consequently the queries used to search for it.

Additionally, tags representing different perspectives about a document’s content, although popular in their

own right, are less likely to co-occur in the same assignment.

Example 5.2. In our del.icio.us crawl, the Mozilla project main page is heavily annotated with tags “open-

source”, “mozilla” and “firefox”. We observed that tags “opensource” and “firefox” appear together much less

frequently than expected given their popularity, demonstrating two different perspectives for viewing the web site:

as the home of the Firefox browser or as an open source project. Such statistical deviations, more or less severe,

were observed throughout the del.icio.us collection.

1Recent research [32] indicates that this assumption holds to a satisfactory degree. The two distributions are similar enough for ourassumption to be plausible, but different enough for tags toprovide incremental information not present in query logs.


Therefore, the assignments comprising the tagging historyof an extended document are sequences of tags

exhibiting strong tag co-occurrence patterns. In order to accurately estimate the probability of a tag sequence

q being assigned to a documentd, we need to capture this elaborate structure. Simply takinginto account the

frequencies (ind’s history) of the tags comprisingq can lead to gross miscalculations. To this end, we propose

the use of sequentialn-gram models [86, 105, 65], that can effectively model such co-occurrence patterns present

in the assignments (tag sequences) comprising a document’stagging history.

5.3.3 N-gram Models

Consider an assignment comprised of a particular sequencel tags,t1, . . . , tl, ordered from left to right. We are

interested in calculating the probability of this sequenceof tags being assigned to an extended document. More

formally, we are interested in computing the probabilityp(t1, . . . , tn). By employing the chain rule of probability,

we can express it as:

p(t1, . . . , tl) = p(t1)p(t2|t1) · · · p(tl|t1, . . . , tl−1)

=

l∏

k=1

p(tk|t1, . . . , tk−1)

This formula links the probability of a tagtk appearing in the sequence to its preceding tagst1, . . . , tk−1. In

other words, the probability of a tag appearing in the sequence depends on all of the preceding tags. The intuition

behindn-gram models is to compute this probability by approximating the preceding subsequence with only the

lastn− 1 tags:

p(tk|t1, . . . , tk−1) ≃ p(tk|tk−n+1, . . . , tk−1)

The most commonly usedn-gram models are the 1-gram orunigrammodel, so thatp(tk|t1, . . . , tk−1) =

p(tk), the 2-gram orbigrammodel, withp(tk|t1, . . . , tk−1) = p(tk|tk−1), and the 3-gram ortrigram model that

approximatesp(tk| t1, . . . , tk−1)= p(tk|tk−2, tk−1). It is clear that the use ofn-gram models is associated with an

inherent trade-off. Higher order models utilize more information and are able to approximatep(tk|t1, . . . , tk−1)

more accurately, at the expense of an increased storage and computation overhead.

In order to ease notation we use the bigram model in our examples and mathematical formulations, since the

concepts can be easily generalized for higher ordern-gram models. Under the bigram model, the probability of a

tag appearing in the sequence depends only on the preceding tag so that:

p(tk|t1, . . . , tk−1) = p(tk|tk−1)

p(t1, . . . , tl) =l

∏

k=1

p(tk|tk−1)


Each adjacent pair of tags (words) in a sequence (assignment) is also known as a bigram, but it will be clear

from the context whether we refer to the model or to a pair of adjacent tags. Similarly, a single tag will be referred

to as a unigram. The bigram probabilitiesp(tk|tk−1) for an extended document can be computed from its tagging

history, by using its previously posted assignments as training data. The most natural way to estimatep(tk|tk−1)

is by using Maximum Likelihood Estimation (MLE). Then, the probability of a bigramt1, t2 is computed as:

p(t2|t1) =c(t1, t2)

∑

t

c(t1, t)

wherec(t1, t2) are the number of occurrences of the corresponding bigramt1, t2 in the training data, that is the

assignments associated with the extended document, and∑

t c(t1, t) is the sum of the occurrences of all different

bigrams involvingt1 as the first tag.

Summarizing our approach, in order to compute the probability that a given tag sequenceq = t1, . . . , tl is used

to annotate an extended documentd, which as we discussed in Section 5.3.2 will enable us to rankthe documents

according to their relevance toq, we use the past tagging activity of the users in order to train a bigram model

for each extended document in our collection. The bigram models can then be used to evaluate the probability

p(q is used to tagd) = p(t1, . . . , tl|d) for each extended documentd.

5.3.4 Interpolation

A limitation of the plain bigram model presented previouslyis the problem of sparse data [140, 86, 105, 65].

Because the size of the data used to train the model is typically limited, the probability of any bigramt1, t2 not

appearing at least once in the training data will be zero, sincec(t1, t2) = 0. This is undesirable as any sequence

of tags that contains a bigram never seen before, will evaluate to zero probability. As an example, consider a

document heavily tagged with the words “Toronto” and “snow”. If for some reason both tags fail to appear

in adjacent positions in any assignment, the document should intuitively be less relevant to the query “Toronto

snow”, but not completely irrelevant.

To compensate for this limitation, a wealth ofsmoothingtechniques can be employed [86, 105, 65]. The idea

motivating these methods is that the bigram count distribution should be made smoother by subtracting a bit of

probability mass from higher counts and distributing it amidst the zero counts, so that no bigram evaluates to zero

probability.

For our purposes we employ the widely-used, intuitive and powerful Jelinek-Mercer linear interpolation tech-

nique. Let us consider a bigramt1, t2 and let p(t2|t1) and p(t2) be the MLE bigram and unigram estimates

respectively. The unigram MLE estimate is simply the numberof times that a tag appears in the training data over

the total number of tags. Then the bigram probability is provided by linearly interpolating both MLE estimates:


p(t2|t1) = λ2p(t2|t1) + λ1p(t2), λ1 + λ2 = 1

The motivation behind this solution is that when there is insufficient data to estimate a probability in the

higher-order model (bigram), the lower-order model (unigram) can provide useful information.

Motivated similarly, it is common practise to also interpolate a bigramt1, t2 using the probabilitypbg(t2) of

t2 appearing in random text. In our case, we interpolate with the background probability of the tag being used by

a user, which we estimate as the total number of times this tagwas used in the context of any document in the

collection, over the total number of tags assigned to the documents of the collection. By using the background

probability of a tag as an interpolation factor, it is possible to assign non-zero (but small) probability to sequences

(queries) that contain tags not appearing a document’s history. Intuitively, a document tagged with “Toronto”,

but not “snow”, should be somewhat relevant to query “Toronto snow” and not completely irrelevant. Finally, the

Jelinek-Mercer estimate of a bigram is:

p(t2|t1) = λ2p(t2|t1) + λ1p(t2) + λ0pbg(t2)

0 ≤ λ0, λ1, λ2 ≤ 1, λ0 + λ1 + λ2 = 1

or

p(t2|t1) = λ2p(t2|t1) + λ1p(t2) + (1− λ1 − λ2)pbg(t2)

0 ≤ λ1, λ2 ≤ 1, λ1 + λ2 ≤ 1

5.3.5 Advantages of Linear Interpolation

Although, as was mentioned, there exists a wealth ofn-gram smoothing methods [86, 105, 65, 39], the use

of the Jelinek-Mercer linear interpolation technique offers two unique advantages, besides its great smoothing

performance.

The first is our ability to devise a novel and efficient method for initially setting and subsequently maintaining,

as new assignments are attached to an extended document, theparameters of the corresponding per-document

smoothedn-gram models. The technique, which we present in Section 5.4, guarantees the applicability of the

proposedn-gram based ranking approach to real-life systems of immense size, containing hundred of millions

or even billions of documents. Efficient parameter setting for other smoothing techniques is not straightforward.

E.g., [65, 39] utilize a generic derivative-free method forcomputing smoothing model parameters.

Secondly, the linearly interpolated bigram models can be associated with the social annotation process in a

very natural and intuitive manner. The probability of a bigramt1, t2 is computed as the weighted average of its

MLE bigram probabilityp(t2|t1), its MLE unigram probabilityp(t2) and its background probabilitypbg(t2). The


values of the interpolation parametersλ2, λ1 andλ0, signify our confidence into each of these three sources of

information.

Consider an extended document that has been annotated with almost random tags, so that all assignments are

in disagreement. In that case, little information can be derived from the assignments’ content and the relevant

bigram and unigram probabilities that have been extracted from them. This should be reflected in the parameters

by settingλ2 andλ1 to low values. If the assignments are in agreement, but exhibit no correlation in the co-

occurrence patterns of tags, then we should place high confidence in the unigram probabilities (λ1) computed, but

lower in the respective bigram probabilities (λ2). Lastly, if assignments are in compliance and exhibit strong tag

co-occurrence patterns, we should place our trust in the bigram probabilities computed, thus setting parameterλ2

to a high value.

In Section 5.6.3 we compare the smoothing performance of linearly interpolatedn-grams with other alterna-

tives.

5.4 Parameter Optimization

Setting the interpolation parameters to meaningful and appropriate values is a challenge that needs to be addressed.

In this section we discuss the algorithm used currently for setting the parameters, as well as its limitations, and

introduce a novel adaptation of powerful optimization algorithms for handling the problem much more efficiently.

In our exposition we use the bigram model and generalize our results ton-gram models at the end of the section.

5.4.1 Likelihood Function

The intuitive parameter setting procedure that we described in Section 5.3.5 can be performed by dividing the

training data (tagging history) into two sets. The first is used to compute the MLE estimates, while the second,

known as held-out set, is used for “learning” the parametersλi. The interpolation parameters are set to the values

that maximize the likelihood of the held-out set being generated by the interpolated bigram model. In our case we

can divide the assignments into two groups, constituting the training and held-out data.

Let us compute the (log)likelihood function that needs to bemaximized. Suppose that the held-out data set

containsm assignmentsa1, . . . , ai, . . . , am each one of them containingk(i) tags,ti1, . . . , tik(i). The likelihood

of an assignment is:

log p(ai) = log

k(i)∏

j=1

p(tij |ti(j−1)) =

k(i)∑

j=1

log p(tij |ti(j−1))

Since assignments are generated independently, by different users, the likelihood of all assignments in the


held-out data is

log

m∏

i=1

p(ai) =

m∑

i=1

log p(ai) =

m∑

i=1

k(i)∑

j=1

log p(tij |ti(j−1))

Notice that this is the sum of the log-probabilities of all bigrams in the held-out set. To ease notation, we will

consider that the training set is comprised ofl =∑m

j=1 k(j) bigramsti1ti2, i = 1 . . . l. Then, the likelihood can

be written as∑l

i=1 log p(ti2|ti1).

Since we are using a bigram model,p(ti2|ti1) = λ2p(ti2|ti1) + λ1p(ti2) + (1− λ1 − λ2)pbg(ti2). In order to

further ease notation we write

p(ti2|ti1) = λ2pi2 + λ1pi1 + pi0

wherepi2 = p(ti2|ti1)− pbg(ti2), pi1 = p(ti2)− pbg(ti2) andpi0 = pbg(ti2).

Then, the likelihood function that needs to be maximized is:

L(λ1, λ2) =

l∑

i=1

log(λ2pi2 + λ1pi1 + pi0)

An important observation that simplifies the maximization problem is that the functionL(λ1, λ2) is concave

[29].

Definition 5.1. A functionf : D → ℜ is concave if∀x,y ∈ D and0 ≤ θ ≤ 1, we have thatf(θx+(1− θ)y) ≥

θf(x) + (1 − θ)f(y).

Concavity is essentially the symmetric property of convexity. A functionf is concave iff−f is convex. An

important property of concave functions is the following.

Theorem 5.1. [29] If f : D → ℜ is concave, any point that is a local maximum is also a global maximum.

Therefore, any optimization procedure that converges to a local maximum will identify the global maximum

of the function. The concavity ofL(λ1, λ2) can be easily demonstrated using the properties of concave functions

[29].

Although the concavity ofL(λ1, λ2) simplifies the optimization problem due to the absence of local optima,

a complication that needs to be considered is the constrained domain ofλ1, λ2: remember that0 ≤ λ1, λ2 ≤

1, λ1 + λ2 ≤ 1. We will denote this constrained domain asD∗. The original domainD ⊇ D∗ of L(λ1, λ2)

depends on the specific values ofpi2, pi1, pi0. Figure 5.1 illustrates the constrained domainD∗.

Let us denote withλ∗ = (λ∗1, λ

∗2) the global maximum (if it exists) ofL(λ1, λ2), and letλc be the point where

L(λ1, λ2) evaluates to its maximum value withinD∗. If λ∗ ∈ D∗, thenλ∗ = λc. However, it is possible that


λ∗ 6∈ D∗ or thatL(λ1, λ2) is unbounded, i.e.,limλ1→∞ L(λ1, λ2) = ∞ or limλ2→∞ L(λ1, λ2) = ∞. In these

casesλc must be identified. Our goal in optimizingL(λ1, λ2) is locatingλc, regardless whetherλ∗ = λc or not.

5.4.2 EM Algorithm

The standard method [86, 39] for optimizing the likelihood function and setting the parameters is by using the

Expectation-Maximization (EM) algorithm. The EM algorithm is an iterative optimization procedure commonly

used for optimizing the objective functions of probabilistic models in the presence of latent variables. Each

iteration of the EM algorithm, comprised of the so called Expectation and Maximization steps, is guaranteed

increase the value of the objective function, eventually converging to a local optimum.

In our case, the probability of a bigramt1, t2 is a weighted combination of the bigram probabilityp(t2|t1), the

unigram probabilityp(t2) and the background probabilitypbg(t2). In other words, we have modeled the bigram

probability as amixtureof three models and the latent variable in this case determines which of these models

will be used to compute the final bigram probability. This observation is the basis for deriving the EM algorithm

iterations for our application.

For each of then bigrams in the held-out data set, we introduce two auxiliaryvariablesqi1 andqi2. Then:

E-step: qk+1i1 =

λk1(pi1 + pi0)

λk2pi2 + λk

1pi1 + pi0

qk+1i2 =

λk2(pi2 + pi0)

λk2pi2 + λk

1pi1 + pi0

M-step: λk+11 =

∑ni=1 q

k+1i1

n

λk+12 =

∑ni=1 q

k+1i2

n

Proof. For a related proof, see [24].

Another important property of the EM algorithm in our case isthat if the starting point is inD∗, then the

algorithm during the search forλc will remain withinD∗. Therefore, the EM algorithm (a) increases with every

iteration the value ofL(λ1, λ2) and (b) remains withinD∗. Due to these two properties, the algorithm converges

toλc, even ifλc 6= λ

∗.

5.4.3 Adapting Unconstrained Optimization Methods for Constrained Optimization

The EM algorithm is an attractive solution for optimizingL(λ1, λ2). It is extremely simple to implement and

converges to the optimal value of the parameters within the constrained domain. However, as it has been observed

in practice [151] and we will also experimentally validate in Section 5.6, its convergence can be slow. Given that


λ1

λ2

λ1+λ2=1

λ1=0

λ2=0

D*

λ*

λc

Figure 5.1: The constrained search spaceD∗.

we require the technique for optimizing the interpolation parameters to be scalable to hundreds of millions of

documents, annotated with hundreds to thousands of assignments, its speed is crucial for the applicability of the

proposed solution.

An extremely promising alternative would be the use of efficient numerical optimization techniques. Re-

searchers have developed algorithms for both constrained and unconstrained numerical optimization problems

[29, 117, 104]. However, general constrained optimizationtechniques are too heavyweight and unconstrained nu-

merical optimization methods are not directly applicable to our problem, since our goal is to maximizeL(λ1, λ2)

within its constrained domainD∗.

Additionally, in our problem setting we cannot simply use Lagrange Multipliers [29] in order to incorporate

theequalityconstraint (λ0 + λ1 + λ2 = 1) into the objective function and thus enable the use of unconstrained

optimization methods. The reason is that Lagrange multipliers cannot be used to remove theinequalityconstraints

of our problem, namelyλ0, λ1, λ2 ≥ 0.

In order to compensate, we introduce the RadING optimization framework which leverages efficient numerical

optimization techniques as a primitive in order to maximizeL(λ1, λ2) within its constrained domainD∗.

In what follows, we demonstrate how this can be accomplished, depending on whetherL(λ1, λ2) is bounded

(Section 5.4.3) or unbounded (Section 5.4.3). Our results are unified into a simple but powerful optimization

framework (Section 5.4.3). Finally, we identify a particular numerical optimization technique with appealing

properties and argue that it is an ideal candidate for use within the proposed framework (Section 5.4.3).

Bounded likelihood function

As was discussed in Section 5.4.1, the global maximum ofL(λ1, λ2) can either lie inside or outsideD∗. The

following two lemmas demonstrate that ifλ∗ 6= λc, thenλc must lie on the boundary ofD∗ (Figure 5.1).

Lemma 5.1. Letf : D → ℜ be a concave function andx∗ be its global maximum. Let alsox ∈ D be a random


point. Then every pointv = kx∗ + (1− k)x, 0 ≤ k ≤ 1, that is located on the segment connectingx andx∗ will

satisfyf(x) ≤ f(v) ≤ f(x∗).

Proof. From the definition of concavity,f(v) = f(kx∗ + (1 − k)x) ≥ kf(x∗) + (1 − k)f(x). Sincex∗ is

the global maximum,f(x∗) ≥ f(x). Then,f(v) ≥ kf(x∗) + (1 − k)f(x) ≥ kf(x) + (1 − k)f(x) = f(x).

Therefore,f(x) ≤ f(v) ≤ f(x∗).

Lemma 5.2. Let f : D → ℜ be a concave function andD∗ ⊂ D be a convex subset of the function’s domain.

Let xc be the value that maximizesf within D∗ andx∗ the value that globally maximizesf . If x∗ ∈ D − D∗,

thenxc lies on the boundary ofD∗.

Proof. Let xc lie in the interior ofD∗. Then, according to Lemma 5.1, all the points that lie on the line segment

connectingxc andx∗ will have higher function values thanxc. Sincexc lies insideD∗ andx∗ lies outsideD∗,

this line segment intersects the boundary of the convex setD∗ in a single pointv. Therefore,f(v) ≥ f(xc) and

xc cannot be the optimum withinD∗.

λ1

λ2

D*

λ*

λc

(a)

λ1

λ2

D*

λ*

λ*

λ*

λ*

λ*

λ*

(b)

Figure 5.2: The constrained search spaceD∗.

The previous lemmas give rise to the following strategy. We can use a two-dimensional numerical optimization

algorithm to maximizeL(λ1, λ2). If the optimum lies insideD∗, we have locatedλc. Otherwise, if the procedure

converges to an optimum outsideD∗, we can search along the boundary ofD∗ in order to locateλc.

The search along the boundary can be decomposed to three searches along the three sides ofD∗ (Figure

5.1). Each side can be embedded in a one-dimensional space, therefore we can maximize along each side using a

one-dimensional optimization procedure.

Furthermore, depending on the location ofλ∗, as was identified by the two-dimensional optimization algo-

rithm, we only need to search one or two at most sides ofD∗. This is demonstrated in Figure 5.2(a) by means of


an example. Due to Lemma 5.1, for any point on the perpendicular sides ofD∗, there is a point in the hypotenuse

that evaluatesL(λ1, λ2) to a higher value. Based on this observation, we can partition the plane aroundD∗ into

six areas and depending on the area whereλ∗ is located, only search the relevant sides ofD∗ (Figure 5.2(b)).

Unbounded likelihood function

Due to the nature of our objective functionL(λ1, λ2), we can identify whether the function is unbounded or not

only by inspecting the values ofpi2, pi1 andpi0 .

Consider a single term in the sum of logarithms,log(λ2pi2 + λ1pi1 + pi0) = log(ai), whereai = λ2pi2 +

λ1pi1 + pi0. If pi2 > 0 then we can increase the value ofλ2 as much as we want without worrying aboutai

becoming negative. Also notice that asλ2 → +∞, log(ai) → +∞, therefore the term becomes unbounded.

Similarly, if pi2 < 0, then the term becomes unbounded asλ2 → −∞. The same observations hold for the value

of pi1 and parameterλ1.

However,L(λ1, λ2) is the sum of many such terms. If there exist for examplei, j such thatpi2 > 0 and

pj2 < 0, we can neither increase nor decrease the value ofλ2 towards+∞ or−∞ respectively. As a consequence,

neither thei-th, nor thej-th term can become unbounded. But if∀i, pi2 > 0, then the objective function increases

arbitrarily asλ2 increases. The following lemma formalizes and proves this intuition.

Lemma 5.3. LetL(λ1, λ2) =∑l

i=1 log(λ2pi2 + λ1pi1 + pi0) be the objective function to be optimized andλc1,

λc2 be the optimal parameter values withinD∗. Then,

• If ∀i, p2i > 0 andp1i > 0, thenλc1 + λc

2 = 1.

• If ∀i, p2i > 0 andp1i < 0, then(λc1, λ

c2) = (0, 1).

• If ∀i, p2i > 0 andp1i ≶ 0, thenλc1 + λc

2 = 1.

• If ∀i, p2i < 0 andp1i > 0, then(λc1, λ

c2) = (1, 0).

• If ∀i, p2i < 0 andp1i < 0, then(λc1, λ

c2) = (0, 0).

• If ∀i, p2i < 0 andp1i ≶ 0, thenλc2 = 0.

• If ∀i, p2i ≶ 0 andp1i > 0, thenλc1 + λc

2 = 1.

• If ∀i, p2i ≶ 0 andp1i < 0, thenλc1 = 0.

Proof. The proof of the lemma is by case analysis. Consider for instance case∀i, p2i > 0. Then, ∂L∂λ2 =

∑li=1

pi2

pi2λ2+pi1λ1+pi0> 0, i.e., the function is monotonically increasing asλ2 increases. Consider any point

A = (λ1, λ2) in D∗ such thatλ1 + λ2 < 1 ⇒ 1 − λ1 > λ2. Then,L(λ1, 1 − λ1) > L(λ1, λ2). In other words,


for every point inD∗ − λ1 + λ2 = 1, there is point withλ1 + λ2 = 1 whereL has a higher value. Therefore,

the optimum must satisfyλc1 + λc

2 = 1.

The additional constraintλc1 + λc

2 = 1 resulting from utilizing the lemma, instructs us to search forλc along

the hypotenuse ofD∗, whileλc1 = 0 andλc

2 = 0, to search along one of the perpendicular sides (Figure 5.1). This

can be performed by means of a one-dimensional optimizationtechnique.

RadING optimization framework

The results derived from the application of Lemmas 5.1, 5.2 and 5.3, can be unified into a simple optimization

protocol that utilizes 1D and 2D unconstrained optimization techniques as its primitives.

1. Use Lemma 5.3 to check ifL(λ1, λ2) is unbounded and if so perform 1D optimization to locateλc along

the boundary ofD∗.

2. If the likelihood function is bounded, apply a 2D optimization algorithm to identify the global maximum

λ∗.

3. If λ∗ 6∈ D∗, use Lemma 5.2 to locateλc along the boundary ofD∗.

As we will experimentally verify in Section 5.6, the extra cost of optimizing twice whenλ∗ 6∈ D∗, does not

offset the benefit of using efficient numerical optimizationalgorithms.

Newton’s method

Although the RadING optimization framework is independentof the specific unconstrained optimization tech-

nique that is employed as a primitive, we argue that Newton’smethod and its variants are ideal candidates for the

task.

In brief, the method assumes that the optimization functionis quadratic and fits the parameters using derivative

information at the current point. It then moves to the point that maximizes the quadratic being fitted. This process

converges quadratically fast near the optimum [29, 117, 104].

Newton’s method is considered one of the fastest convergingoptimization methods, yet it can suffer from two

limitations [29, 117, 104]. The first is the need to compute the Hessian matrix of the function at each iteration and

then invert it.

Definition 5.2. The HessianHn×n(f) of a twice differentiable functionf(x1, . . . , xn), is the matrix of all second

order partial derivatives, i.e.,Hij =∂2f

∂xi∂xj.


The second limitation is the requirement that the Hessian bea negative semi-definite matrix at the current

point, in order for the next Newton’s method iteration to increase the value of the objective function.

Definition 5.3. A matrixXn×n is negative semi-definite iff∀ vn×1, vTXv ≤ 0.

Thus, if the Hessian is not negative semi-definite, it needs to be modified by means of a variety of available

time-consuming techniques [117].

To summarize, the use of additional information about the objective function (in the form of its derivatives)

to guide the search for the optimum leads to faster convergence but at a potentially high cost per iteration. This

trade-off has led to the development and use of the so-calleddirect search algorithms (e.g., Powell’s search,

Nelder-Mead) that utilize minimal information about the objective function but demonstrate slower convergence

rates [116].

However, none of the two limitations that we discussed pose aproblem for our objective function. The Hessian

that needs to be computed and inverted is only a2 × 2 matrix and is guaranteed to be negative semi-definite due

to the concavity ofL(λ1, λ2) [29]. Therefore, in our context the cost of a Newton’s methoditeration is minimal,

justifying its use over alternatives that converge at a slower pace.

5.4.4 Incremental Maintenance

When users continuously annotate extended documents with new assignments, the efficientmaintenanceof the

interpolation parameters is critical. After computing theinterpolation parameters from scratch, their values should

be updated when the number of new assignments attached to thedocument exceeds a threshold. The renewed

values will reflect the updated information provided by the new assignments.

The maintenance of the interpolation parameters is in principle the same procedure as optimizing them from

scratch. The difference is that we can utilize the parametervalues learned previously and use them as the starting

values of the optimization algorithms, instead of a random starting point as in the case of optimizing from scratch.

Given the stability of the user tagging activity (Section 5.3), the updated optimal parameter values are not expected

to deviate much from their previous values. Hence, initiating the optimization from a point close to the optimum

will accelerate convergence. As we will experimentally demonstrate in Section 5.6, the proposed optimization

framework is extremely efficient in this critical task, since it can leverage the Newton’s method which converges

quadratically fast near the optimum [29, 117, 104].

5.4.5 Generalization for N-gram Models

Even though we demonstrated the entire process using bigrams for simplicity of exposition, the form and proper-

ties of the likelihood function and its domain carry over to higher ordern-gram models. Namely, in the case of


interpolatedn-gram models, the likelihood function that needs to be maximized is

L(λ1, . . . , λn) =

l∑

i=1

log(λnpin + · · ·+ λ1pi1 + pi0)

wherepik = p(tin|ti(n−k+1), . . . , ti(n−1)) − pbg(tin), i.e.,pik is the modified maximum likelihood estimate

for the k-gram of tagtin (Section 5.4.1). The constrained optimization domainD∗n is defined by inequalities

λ1 ≥ 0, . . . , λn ≥ 0, λ1 + · · · + λn ≤ 1. In other words, rather than being a triangle, domainD∗n is ann-

dimensional polytope. The polytope is depicted forn = 3 (interpolated trigram model) in Figure 5.3.

λ2

λ1

λ3

D3*

Figure 5.3: Constrained search spaceD∗3 .

In the general case, the EM algorithm is extended by simply using one auxiliary variable per interpolation

parameter and appropriately modifying its two steps.

The generalization of the RadING optimization framework isalso straightforward and intuitive. Domain

D∗n is an-dimensional polytope with(n − 1)-dimensional facets. The domain defined by each of the(n − 1)-

dimensional faces is essentially equal to domainD∗n−1 of an equivalent(n−1)-dimensional problem. By utilizing

Lemmas 5.1, 5.2 and the relevant extension of Lemma 5.3, theRadINGoptimization framework now employs

n-dimensional and(n − 1)-dimensional versions of the optimization algorithm. Since the(n − 1)-dimensional

optimization is constrained along a facet with domainD∗n−1, it is recursively defined.

The following two lemmas demonstrate how to handle the casesof an unbounded likelihood function (Lemma

5.4.5) and a likelihood function whose optimumλ∗ lies outside domainD∗n (Lemma 5.5).

Lemma 5.4. LetL(λ1, . . . , λn) =∑l

i=1 log(λnpin+ · · ·+λ1pi1+pi0) be the objective function to be optimized

andλc1, . . . , λ

cn be the optimal parameter values withinD∗. Then,

• For all k, such that∀i, pki < 0, we haveλck = 0.

• If there existsk, such that∀i, pki > 0, we can have thatλc1 + · · ·+ λc

n = 1.


Proof. The proof is similar in spirit to the proof of Lemma 5.3. Without loss of generality, let us assume

that ∀i, pni < 0. Then, ∂L∂λn

< 0 and consequently for(λ1, . . . , λn) ∈ D∗, L(λ1, λ2, . . . , λn−1, λn) ≤

L(λ1, λ2, . . . , λn−1, 0). The equality holds whenλn = 0. If, in addition we have∀i, p(n−1)i > 0, the for

(λ1, . . . , λn) ∈ D∗, we haveL(λ1, λ2, . . . , λn−1, λn) ≤ L(λ1, λ2, . . . , 1 − ∑n−2x=1 λx, 0). Therefore, for any

point inD∗, there exists a point in the boundary identified by Lemma withhigher value forL.

For example, if by inspecting constantspji we identify thatλc1 = 0 andλc

1 + · · ·+λcn = 1, we can setλ1 = 0

andλ2 = 1−λ3−· · ·−λn. Then, we need to optimize for variablesλ3, · · · , λn, with constraints,λ3, . . . , λn ≥ 0

andλ3 + · · ·+ λn ≤ 1, i.e., we need to address an(n− 2)-dimensional optimization problem in domainD∗n−2.

Lemma 5.5. LetL(λ1, . . . , λn) =∑l

i=1 log(λnpin+ · · ·+λ1pi1+pi0) be the objective function to be optimized,

λc1, . . . , λ

cn be the optimal parameter values withinD∗

n andλ∗1, . . . , λ

∗n the optimal parameter values inRn. Then,

• If λ∗i < 0, then facet (boundary)λi = 0 should be checked in order to locateλc.

• If λ∗1 + · · ·+ λ∗

n > 1, then facet (boundary)λ1 + · · ·+ λn = 1 should be checked in order to locateλc.

Proof. The constrained optimum lies on the boundary ofD∗ (Lemma 5.2). Without loss of generality, let us

assume thatλ∗1, . . . , λ

∗k < 0 andλ∗

k+1, . . . , λ∗n ≥ 0 andλ∗

1+ · · ·+λ∗n ≤ 1. We will demonstrate that the optimum

cannot lie on a facetλl = 0, k < l ≤ n or λ1 + · · ·+ λn = 1.

LetλF be a point that lies on facetλFl = 0, k < l ≤ n. Consider the line segmentλ = mλ

∗+(1−m)λF , 0 ≤

m ≤ 1. We will show that this segment intersects each of facetsλi = 0, 0 ≤ i ≤ k. Due to Lemma 5.1, the

intersection points will have higher objective function values. Hence,λF cannot be the optimum. We note that

only the first intersection will truly be on the boundary ofD∗ before the line leavesD∗. The exact facet that the

intersection lies on is not of interest.

The intersection point with facetλi = 0, 0 ≤ m ≤ k has

m =λi − λF

i

λ∗i − λF

i

=−λF

i

λ∗i − λF

i

Becauseλ∗i < 0 and0 ≤ λF

i ≤ 1, we have0 ≤ m < 1. Hence, the segment intersectsλi = 0, 0 ≤ m ≤ k. The

equality holds whenλFi = 0, i.e.,λF is already on facetλi = 0.

Similarly, forλF on facetλF1 + · · ·+ λF

n = 1 we have:

m =

∑ni=1 λi −

∑ni=1 λ

Fi

∑ni=1 λ

∗i −

∑ni=1 λ

Fi

=

∑ni=1 λi − 1

∑ni=1 λ

∗i − 1

λ∈D∗,λ∗

1+···+λ∗

n≤1> 0

The proof whenλ∗1, . . . , λ

∗k > 1 uses the same reasoning.

For instance, if we identify thatλ∗1 < 0 andλ∗

2 < 0, then the likelihood function’s optimum withinD∗n must

lie on either(n − 1)-dimensional facetλ1 = 0 or (n − 1)-dimensional facetλ2 = 0. It is easy to verify that


optimizing on these two facets is equivalent to solving a(n− 1)-dimensional version of the optimization problem

on a domain equivalent toD∗n−1.

An extremely desirable property of Newton’s method, which can be employed by the RadING framework,

is that the number of iterations till convergence remains almost constant, independently of the dimensionality

of the problem [29]. The only considerable additional overhead is the cost of inverting a larger Hessian. In

practice,n-gram models withn > 5 offer no additional performance advantages [65] and given the short length of

assignments in our application, the use of a model more complicated than the trigram is hard to justify. Therefore,

each iteration should involve the inversion of an at most5 × 5 matrix instead of a2 × 2 matrix, which is highly

scalable.

5.5 Searching

We now have all the machinery in place for ranking a collection of extended documents with tag meta-data. The

first step required is to train a bigram model for each extended document, which involves the bigram and unigram

probability computation and the optimization of the interpolation parameters. At query time we can compute the

probability of the query keyword sequence being “generated” by each extended document’s bigram model. More

precisely, the score of each documentd, given a queryq = q1, . . . , qk, is

pd(q1, . . . , qk) =

k∏

j=1

p(qj |qj−1)

wherep(qj |qj−1) = λ2p(qj |qj−1)+λ1p(qj)+λ0pbg(qj) is the interpolated bigram probability. Since the scoring

function used is monotone (a simple multiplication of terms), the Threshold Algorithm (TA) [56] can be employed

for computing the top-k ranking documents.

The whole process is better illustrated with a simple example (Figure 5.4). The example usesnon-interpolated

bigram models to ease exposition, but its extension to the interpolated variant is simple. Consider a collection

of four extended documentsd1, . . . , d4 that is only annotated with two tags,t1 and t2. In order to stress the

importance of a tag appearing in the first position of an assignment, we introduce a “start of assignment” tag,

denoted by〈s〉2. Then the only bigrams that can potentially appear in the assignments of this collection are

(t1|〈s〉), (t2|〈s〉), (t1|t2) and(t2|t1).

The preprocessing step involves going through the collection and calculating the corresponding bigram proba-

bilities for each extended document. (If interpolated models are used, the unigram probabilities and interpolation

parameters are also computed for each document.) These probabilities are subsequently stored in four sorted

lists, one for each bigram. Then, suppose a queryt1, t2 arrives at the system. The score of each documentdi is

2An “end of assignment” tag can also be introduced.


‹s›,t1‹s›,t1

d4: 1/2

d3: 1/3

d1: 0

d2: 0

‹s›,t1,t2‹s›,t1,t2‹s›,t2,t1

‹s›,t1‹s›,t2

TA

t2|‹s›

d1: 1

d2: 1

d3: 2/3

d4: 1/2

t1|‹s›

d3: 1/3

d1: 0

d2: 0

d4: 0

t1|t2

d3: 2/3

d2: 1/3

d1: 0

d4: 0

t2|t1

Preprocessing

p(t1,t2|‹s›)=

‹s›,t1‹s›,t1,t2‹s›,t1

d1

d2

d3

d4

p(t1|‹s›)p(t2|t1)

d3: 4/9

d2: 1/3

Figure 5.4: Query evaluation.

pdi(t1|〈s〉)pdi

(t2|t1). The top-k scoring resources can be computed by invoking the TA algorithm and using the

lists for bigramst1|〈s〉 andt2|t1 (Figure 5.4).


In order to evaluate the solutions proposed within the context of the RadING methodology, we used data from

our crawl of del.icio.us, one of the most popular social annotation systems. The data are comprised of 70,658,851

assignments, posted by 567,539 users and attached to 24,245,248 unique URLs to Web pages (extended docu-

ments). The average length of each assignment is 2.77, whiletheir standard deviation and median are 2.70 and

2 respectively. An interesting observation is that out of all the URLs in our sample, approximately 19M of them

have only been tagged once. This is not an issue with our sample, but rather a characteristic of del.icio.us [73].

5.6.1 Optimization Efficiency

We compared the performance of the RadING optimization framework (Section 5.4.3) against the EM algorithm

(Section 5.4.2) in two tasks: optimizing linearly interpolatedn-gram models from scratch and incrementally

updating their interpolation parameters. The RadING optimization framework employed Newton’s method. The

convergence of the Newton and EM algorithms was declared when an iteration failed to improve the likelihood

function by more than10−9.

In our first experiment, we used both algorithms to optimize from scratch the interpolation parameters of every

URL in our data set, associated with 10 or more assignments. Every 1 in 5 assignments was placed in the held-out

data set and used in the optimization procedure. We discuss techniques that are potentially more appropriate for

setting the interpolation parameters of lightly-tagged documents, or documents that have been tagged only once,


in Section 5.6.4.

Figure 5.5 depicts the total time required by both optimization techniques, to train linearly interpolatedn-gram

models forn = 2 to n = 5. As it is evident, the proposed RadING optimization framework is approximately

4-5 times faster than the EM algorithm. The speed-up can be attributed both to the unique ability of the RadING

framework to utilize the efficient Newton’s method, as well as its ability to rapidly identify unboundedness and

reduce the dimensionality of the optimization procedure.

2 3 4 50

100

200

300

400

500

600

Interpolated n-gram

Tra

inin

g T

ime

(ms)

RadING

EM

Figure 5.5: Total training time.

Figures 5.6(a) and 5.6(b) depict the time required by both techniques to optimize for a extended single docu-

ment forn = 2 (linearly interpolated2-gram model), with respect to the number of its assignments.Both methods

scale linearly, but as it is evident, the introduced optimization framework is about 4 times faster than the EM al-

gorithm. This guarantees the applicability of the proposedsolution in efficiently optimizing documents with any

number of assignments.

In order to simulate a setting where assignments are continuously added to the document collection and eval-

uate the incremental maintenance efficiency of the algorithms (Section 5.4.4), we designed the following experi-

ment. In this experiment only extended documents with more than 200 assignments were used. The assignments

of each document were sorted in chronological order and the first 100 assignments were used for initially setting

the parameters. Then we triggered a re-optimization every 50 new assignments until all the assignments of the

document were exhausted. We measured the total optimization and subsequent re-optimization time for each

document.


(a) (b)

Figure 5.6: Time vs Number of Assignments.

Figure 5.7 presents the total time that was required by each method to incrementally maintain the documents

described above, and for linearly interpolatedn-gram models of varying sophistication. The RadING optimization

framework offers a large performance benefit in the case of2-gram and3-gram models. As was discussed in

Section 5.4.4, this can be attributed to the quadratic convergence of Newton’s method near the optimum. This

benefit diminishes for higher ordern-gram models. For higher ordern-grams, it is much more common for the

optimum to lie on a lower dimensional facet of the constrained optimization regionD∗n. Hence, multiple lower

dimensional facets must be checked for the optimum in order to guarantee correctness, even though we initiate

our search close to it.

2 3 4 50

500

1000

1500

Interpolated n-gram

Incr

emen

tal M

aint

enan

ce T

ime

(s)

RadING

EM

Figure 5.7: Total incremental maintenance time.


5.6.2 Ranking Effectiveness

We performed a large scale experiment to evaluate the ranking effectiveness of the RadING ranking strategy.

We evaluated the performance of linearly interpolatedn-grams – employed by the RadING methodology – of

varying complexity, and compared against a significant number of competing alternatives. More specifically, we

also considered plain (non-interpolated)n-gram models, in order to validate the benefits offered by interpolation,

and two adaptations of the widely used tf/idf ranking [106].As we subsequently demonstrate, while tf/idf based

approaches are extremely powerful and constitute the most appropriate alternative in many contexts, they fail

to capture the significant and elaborate organization of tags into assignments, that the proposed solution was

explicitly designed to model. Research in establishing a connection between tf/idf and language models [74, 127,

126] highlights tf/idf’s relation to aninterpolated1-grammodel [74, 127].

The first method, denoted by Tf/Idf, concatenates the assignments of a document into a new synthetic “docu-

ment” and performs the ranking based on the tf/idf similarity of the query and these “documents”. In the second

variation, denoted by Tf/Idf+, we compute the tf/idf similarity of the query to each individual assignment and

rank the documents based on theaveragesimilarity of the query to their corresponding assignments.

The lack of a publicly available and widely-accepted test collection, comprised of both extended documents

with tag meta-data and relevant queries (in the spirit of theTREC collections3), renders the comparison of the

different approaches particularly challenging. Hence, ranking effectiveness was evaluated by asking impartial

judges to decide upon the relevance of the top results returned by the various methods in response to keyword

queries and measuring the precision achieved by each method.

The judges were contacted through the Amazon Mechanical Turk service4. The Mechanical Turk service

is an on-line “marketplace for work”, bringing together users that need information gathering/processing tasks

performed and users willing to perform such tasks in exchange for monetary compensation. The workers remain

anonymous but have every incentive to perform to the best of their ability since the employer retains the right to

review and potentially reject poor work, thus penalizing the worker’s reputation.

More specifically, we designed and executed the following experiment. For each keyword queryq that we

tested, we retrieved the top-10 results produced by the various alternatives. The results consisted of URLs pointing

to web pages. LetRA(q) be the set of the top-10 results, with respect to a single query q, produced by ranking

methodA. The URLs comprising the unionR(q) =⋃

A RA(q) of the top-10 URLs from all methods considered

were shuffled and presented to 10 impartial, anonymous and unknown to us judges (Mechanical Turk workers).

Every one of the judges (workers) was asked to evaluate whether the content of each web pagep ∈ R(q)

was relevant or not (as a response to the corresponding keyword queryq). No additional information was given

3http://trec.nist.gov4www.mturk.com


with respect to which method was used to retrieved or its ranking position in the corresponding top-10 ranking.

This was done to avoid biasing the judges in favor of a certaintechnique or high-ranking URLs. Furthermore, no

judge’s decision was rejected by us. We compensated for any potential judge misconduct by employing a large

number of independent judges.

Given that the relevance of each resultd was evaluated by 10 judges, we computed the relevancer(d) of page

d as the fraction of the 10 judges that found the page to be relevant, i.e., if 9 out of 10 judges foundd to be

relevant, its relevance score wasr(d) = 0.9. Using the decisions of the judges, we computed the Precision at 10

(Precision@10) [106] performance of the various methods for a number of queries. More formally, for the top-10

resultRA(q) of methodA we computed Precision@10 =∑

p∈RA(q)

r(d), namely the aggregate relevance of the

top-10 results inRA(q).

Example 5.3. For instance, a Precision@10 value of 8.3 for a particular methodA and in response to a queryq,

implies that approximately 8 of the top-10 results retrieved by methodA were found to be relevant. Alternatively,

83 of the 100 relevance judgements (10 judges× 10 binary decisions each) associated with the top-10 results

computed byA were positive.

We submitted to Mechanical Turk for evaluation a large number of diverse queries and present results for a

sample consisting of 18 representative ones. Similar results were obtained for the remainder of the submitted

queries. Additionally, for our experiment we used a fraction of our original data set comprised of all URLs that

were tagged at least 5 times (660k URLs), in order to avoid presenting URLs for which a consensus among the

users of del.icio.us had not yet emerged and therefore introducing URLs that were irrelevant exclusively due to

noise.

Table 5.1 presents the Precision@10 performance of 4 different techniques.LI2g stands for linearly in-

terpolated2-gram,LI3g for linearly interpolated3-gram, while2g and3g stand for plain2-gram and3-gram

respectively. Two conclusions can drawn from the data.

First, the use of more sophisticated interpolatedn-gram than a2-gram model does not seem to improve ranking

effectiveness. The performance of theLI2g andLI3g methods (both part of the RadING ranking strategy) and

the corresponding rankings produced is almost identical. This can be attributed to the relatively short length of

tag sequences assigned by users. The median length of a sequence was only 2.7. Hence, an interpolated2-gram

model is sufficient for capturing tag co-occurrence patterns in such short tag sequences.

Second, the use of linearly interpolatedn-grams (LI2g, LI3g) offers substantial benefit over the use of plain

n-gram models (2g, 3g). The reason is that most of the URLs in the collection are associated with relatively few

assignments, i.e., the training data associated with most URLs is sparse (Section 5.3.4). This is not an artifact of

our data set, but rather a characteristic of social annotation systems in general [73]. We present in parentheses,


Query LI2g LI3g 2g 3g

birthday gift ideas 7.4 7.4 3.6 (4) 1.0 (1)

college blog 7.6 7.6 6.6 (10) 7.6 (10)

trigonometric formulas 7.9 7.9 0.9 (1) 0.9 (1)

stock market bubble 8.4 8.4 1.0 (1) 0.0 (1)

sea pictures 6.7 6.3 2.1 (3) 0.9 (1)

free music recording software 7.6 7.6 7.6 (10) 2.6 (3)

insomnia cure 8.4 8.6 1.9 (2) 0.9 (1)

red wine benefits 8.2 8.2 0.0 (0) 0.0 (0)

software draw bubbles 6.0 6.2 0.0 (0) 0.0 (0)

duck recipe 6.3 6.3 0.7 (1) 0.7 (1)

chinese food 8.7 8.7 8.2 (10) 8.5 (10)

recycling tips 9.3 9.3 8.8 (10) 5.3 (7)

water filter 8.5 8.5 9.1 (10) 9.1 (10)

economics tutorial 8.9 8.9 8.0 (10) 8.4 (10)

linear algebra book 9.1 9.1 9.0 (10) 5.5 (6)

f1 technology 7.9 7.9 0.7(1) 0.0 (0)

coffee machine 7.4 7.4 7.7 (10) 6.7 (9)

history books 8.4 8.4 7.9 (10) 7.2 (10)

Average Precision@10 7.93 7.93 4.65 3.63

Table 5.1: Precision@10 for a representative query sample.

next to the Precision@10 value of the2g and3g methods, the overall number of URLs that evaluated to non-zero

relevance probability and were, hence, retrieved. In most cases, the number is considerably less than 10.

Table 5.2 contrasts the ranking effectiveness of the RadINGranking strategy with that of the tf/idf based

alternatives. RadING utilized interpolated2-gram models. As it is evident, the proposed ranking solution is

superior to both adaptations of the tf/idf similarity metric. On average, the2-gram based approach achieves a

31% and44% better Precision@10 score than the Tf/Idf and Tf/Idf+ methods respectively.

The superior ranking effectiveness of RadING can be attributed to the ability of the interpolated2-grams to

model and utilize the tag co-occurrence patterns that are observed in the URLs’ tagging history. This crucial piece

of information cannot be exploited by the tf/idf based approaches. Both Tf/Idf and Tf/Idf+ fail to capture the fact

that less-frequent tags can co-occur with high frequency and instead only rewards high tag frequencies, therefore

underestimating the relevance of many URLs.

Additionally, the improved ranking effectiveness demonstrated by the RadING solution over the tf/idf based


Query LI2g (RadING) Tf/Idf Tf/Idf+

birthday gift ideas 7.4 2.8 5.8

college blog 7.6 5.9 4.9

trigonometric formulas 7.9 6.3 5.2

stock market bubble 8.4 5.1 6.1

sea pictures 6.7 4.5 2.1

free music recording software 7.6 6.0 5.5

insomnia cure 8.4 6.5 7.0

red wine benefits 8.2 6.1 6.5

software draw bubbles 6.0 2.1 2.9

duck recipe 6.3 3.4 3.8

chinese food 8.7 7.7 2.2

recycling tips 9.3 8.0 8.2

water filter 8.5 7.1 5.1

economics tutorial 8.9 7.8 6.5

linear algebra book 9.1 7.0 6.4

f1 technology 7.9 7.5 7.6

coffee machine 7.4 7.1 5.9

history books 8.4 8.4 7.3

Average Precision@10 7.93 6.07 5.50

Average Improvement n/a +31% +44%

Table 5.2: Precision@10 for a representative query sample.

alternatives is statistically significant and is not an artifact of limited experimentation. Consider a queryq and

the top-10 results returned by RadING and Tf/Idf solutions.Each set of results is associated withm = 100

binary relevance judgements (URL relevant/not relevant).The fractionr of positive judgements is intrinsically

linked to the corresponding Precision@10 value: we simply have that Precision@10 = 10 ∗ r. Notice thatr is

(the maximum likelihood estimate of) the probability of a randomly selected judgement being positive. In order

to demonstrate that the improvement offered by RadING for query q is significant, we need to demonstrate that

rRadING > rTf/Idf with high confidence. This would imply that the corresponding Precision@10 scores follow

the same relation. Using a binomial statistical test the null hypothesisrRadING < rTf/Idf can be rejected at the

1% confidence level (p-value) for the first 15 queries presented in Table 5.2. Hence, we haverRadING > rTf/Idf

with high confidence.

Lastly, Figure 5.8 presents the average Precision@k scores achieved by the three ranking solutions, on the


same query sample and for different values ofk. The average Precision@k values have been normalized using

the value ofk, i.e., we present the corresponding Precision@k/k values. As it is evident, the RadING ranking

strategy consistently outperforms the tf/idf based approaches for all values ofk. Notice also the extremely high

Precision@1 score achieved by RadING: the top-1 query result returned bythe technique was found to be highly

relevant on all occasions. Additionally, although the Precision@k performance of RadING is consistently high,

it is better for smaller values ofk. This trend implies that RadING produces an intuitive and desirable ranking,

with more relevant URLs appearing higher in the top-10 result. This trend is not observed for the tf/idf based

approaches.

Precision@1 Precision@3 Precision@5 [email protected]

0.5

0.6

0.7

0.8

0.9

1

Ave

rage

Pre

cisi

on@

k/k

RadINGTf/IdfTf/Idf+

Figure 5.8: Average Precision@k.

5.6.3 Linearly Interpolated n-grams in Detail

The main observation underlying RadING is the presence of significant and important tag co-occurrence patterns

in user assignments, that the interpolatedn-gram models, withn ≥ 2, can capture. The improved ranking

performance of the interpolatedn-grams over Tf/Idf, as observed in Section 5.6.2, provides indirect confirmation

of this structure. We now provide direct evidence by studying the values of the linear interpolation parameters, as

well as assessing the performance of interpolatedn-grams of varying complexity on a subset of the assignments

(validation set). The good smoothing performance of linearly interpolatedn-grams is also verified, by comparing

it with that of an alternative smoothing technique.

An accepted metric for quantifying and comparing the performance of language models iscross-entropy

[65, 39]. For each URL in our data set with at least 20 assignments, we used the most recent 20% of its as-

signments for model assessment. The remaining 80% were usedto train the corresponding language model. For


the linearly interpolatedn-grams, every 1 in 5 assignments was used as held-out data forcomputing the interpo-

lation parameter values, while the remaining assignments were used to compute the maximum likelihoodn-gram

probabilities. Then, the cross-entropy of the language model is computed as:5

Log Likelihood of Validation Data# of Tags in Validation Data

Intuitively, the cross-entropy is the average log-probability of each tag in the validation set assignments, as deter-

mined by each language model assessed. Higher cross-entropy values indicate that the language model captures

the assignments it models more accurately.

Figure 5.9 depicts the average cross-entropy achieved by linearly interpolated1-gram,2-gram and3-gram

models. A different model was trained and evaluated for eachURL in our data set. The URLs where organized into

6 segments according to the number of assignments associated with them. The cross-entropy of each model/URL

in the segment was computed and the values averaged. The average cross-entropy for linearly interpolated4-grams

and5-grams is omitted as the results were practically identicalwith that of the3-gram.

20-49 50-99 100-199 200-499 500-999 1000+-4

-3.9

-3.8

-3.7

-3.6

-3.5

-3.4

-3.3

-3.2

-3.1

-3

Number of Assignments per URL

Ave

rage

Cro

ss-E

ntro

py

Linearly Interpolated 1-gramLinearly Interpolated 2-gramLinearly Interpolated 3-gram

Figure 5.9: Linearly interpolatedn-gram assessment.

The improved performance of the2-gram model over the1-gram is apparent. This directly demonstrates

that there exists structure that the interpolated2-gram can and does capture. If the organization of tags into

assignments was random, the cross-entropy of the2-gram and1-gram models wouldn’t be significantly different.

The performance of higher order interpolatedn-grams is very close to that of the2-gram. Only for heavily tagged

5We use the natural base logarithm in computing the numerator.


documents does a marginal improvement emerge. The picture revealed in Figure 5.9 is consistent with the ranking

effectiveness results of Section 5.6.2: the linearly interpolated2-gram offers improved performance over Tf/Idf,

which like the interpolated1-gram fails to capture the organization of tags into assignments, while higher order

n-grams offer little, if any, improvement.

In Figure 5.9 we also observe that for all interpolatedn-grams average cross-entropy increases as the number

of available assignments increases. This is an intuitive trend: as more assignments accumulate from different

users, the more accurate and consistent our view of the relevant document becomes.

Complementary to Figure 5.9 is Figure 5.10. In this experiment, we trained a linearly interpolated5-gram

for each URL in our data set that was annotated with at least 20assignments. Once more, URLs were organized

into 6 segments. For each segment, we registered the value distribution of each interpolation parameter:λ0

corresponding to the background term, up toλ5 corresponding to the fivegram term.

20-49 50-99 100-199200-499500-999 1000+0

0.2

0.4

0.6

0.8

λ0

Number of Assignments

Pa

ram

ete

r V

alu

e (

Me

dia

n)

20-49 50-99 100-199200-499500-999 1000+0

0.2

0.4

0.6

0.8

λ1


Pa

ram

ete

r V

alu

e (

Me

dia

n)

20-49 50-99 100-199200-499500-999 1000+0

0.2

0.4

0.6

0.8

λ2

Number of AssignmentsP

ara

me

ter

Va

lue

(M

ed

ian

)

20-49 50-99 100-199200-499500-999 1000+0

0.2

0.4

0.6

0.8

λ3


Pa

ram

ete

r V

alu

e (

Me

dia

n)

20-49 50-99 100-199200-499500-999 1000+0

0.2

0.4

0.6

0.8

λ4


Pa

ram

ete

r V

alu

e (

Me

dia

n)

20-49 50-99 100-199200-499500-999 1000+0

0.2

0.4

0.6

0.8

λ5


Pa

ram

ete

r V

alu

e (

Me

dia

n)

Figure 5.10: Document interpolation parameter values (25%-percentile,median,75%-percentile).

Parameterλ2, corresponding to the bigram term, acquires high values. This points to the presence of non-

trivial co-occurrence patterns of tags in assignments: if these patterns where absent, parameterλ2 would acquire

a low value during the optimization process. This is the casefor parametersλ4 andλ5. Parameterλ3, and

correspondingly the trigram term, becomes informative forheavily tagged URLs.

The above trends are consistent with those illustrated in Figure 5.9. The interpolated2-gram performs better

than the1-gram (significant values forλ2). As the number of available resources increases, the performance

difference between the2-gram and the1-gram increases (progressively higher values forλ2). The interpolated3-

gram model offers some performance improvement only for heavily tagged URLs (λ3 acquires significant values


for such URLs). Interpolated4-grams and5-gram do not improve upon the3-gram (λ4, λ5 have values close to

0).

20-49 50-99 100-199 200-499 500-999 1000+-4

-3.9

-3.8

-3.7

-3.6

-3.5

-3.4

-3.3

-3.2

-3.1

-3


Ave

rage

Cro

ss-E

ntro

py

Linearly Interpolated 2-gramInterpolated Absolute Discount 2-gram

Figure 5.11: Comparison to Interpolated Absolute Discount.

Finally, in Section 5.3.5 we vindicated the use oflinearly interpolatedn-grams on grounds of their training-

time and query-time efficiency – which are critical for the development of large-scale applications – as well as

their natural association with the aspects of the social annotation process under study. Additionally, we performed

an experiment to verify that linear interpolation offers good smoothing performance when compared with other

alternatives.

[156] also identifiesAbsolute Discountas an efficientn-gram smoothing technique, applicable to large scale

Information Retrieval systems. We considered the interpolated variation of absolute discount as described in [39],

which is defined recursively as:

pabs(tm|tm−n+1, . . . , tm−1) =

max(c(tm−n+1, . . . , tm)− δ, 0)∑

tic(tm−n+1, . . . , tm−1, ti)

+δN1+(tm−n+1, . . . , tm−1, •)∑

tic(tm−n+1, . . . , tm−1, ti)

pabs(tm|tm−n+2, . . . , tm−1)

whereN1+(tm−n+1, . . . , tm−1, •) is the number of distinct tags that follow tag sequencetm−n+1, . . . , tm−1 in

the training data, whilec(tm−n+1, . . . , tm−1) is the number of its occurrences.

Parameterδ for each model was set using the same amount of held-out data as the linearly interpolatedn-

grams and their cross-entropy validated on the same most recent 20% of URL assignments. Figure 5.11 presents


the experimental results. We only present the absolute discount2-gram as the performance of higher ordern-

grams deteriorated. Evidently, linear interpolation, besides its other advantages, also offers improved smoothing

performance.

5.6.4 Discussion

The introduced RadING methodology offers an additional andsubstantial benefit, besides its superior ranking

effectiveness and scalability, namely the ability to incorporate a number of features affecting ranking in aprin-

cipled manner. This can be performed by employing non-uniform document prior probabilities (Section 5.3.1)

and modifying, if needed, the interpolation parameters forcertain documents (Section 5.3.5). In what follows

we do not offer concrete solutions, but rather hint at possibilities for providing a richer searching experience in a

collaborative tagging context.

As an example, the document priors can be used to bias the finalranking towards extended documents that

are popular. Another option would be biasing the ranking in favor of more recent documents. This would offer

the possibility for such documents to be more easily discovered and therefore allow them to build momentum,

provided that users find them interesting enough to tag them.Furthermore, the non-uniform document priors can

also be used to offer a personalized searching experience. This much-desired property could be provided in a

practical and scalable manner by clustering the users and employing different priors for different user clusters.

Lastly, a significant number of extended documents in a collection can be expected to be tagged only once or a

handful of times at most. Whether the assignments associated with those lightly-tagged documents can be trusted

is debatable. Regardless, the linearly interpolated models that we utilize offer a principled solution for addressing

this issue. For such documents,if desired, we can skip the interpolation parameter optimization process and

instead set those parameters manually. As we discussed in Section 5.3.5, we can use parametersλ2, λ1 andλ0

to express our confidence towards the bigram, unigram and background tag probabilities of a document’s history

respectively. For example, if we lack confidence on the assignments of lightly-tagged documents, we can set

parametersλ1 andλ2 to low values. This type of reasoning should also be applied to documents that have been

tagged only once. The ability to handle lightly-tagged documents in such a flexible and principled manner is

unique to our approach.

Another parameter estimation approach that is appropriatefor lightly-tagged extended documents (but can

also be applied to all documents) iscross-validation. The assignments associated with a document are separated

into b disjoint “buckets”. Then,b rounds of parameter estimation are performed: each bucket is used as the held-

out data set, while the remainingb − 1 buckets comprise the training data set. The final interpolation parameter

values are calculated by averaging the values computed in the b optimization rounds.


5.7 Conclusions

In this Chapter we presented a principled, efficient and effective ranking methodology for socially annotated

extended documents that utilizes statistical language modeling tools, as motivated by our understanding of the

collaborative tagging process. Training of the language models was performed by means of a novel optimization

framework that outperforms by a large margin the optimization techniques employed currently. The validity of

our claims was verified using a massive data set extracted from a popular social annotation system.

Chapter 6

Skyline Maintenance for Dynamic

Document Collections

6.1 Introduction

Theskylinequery is an elegant, intuitive approach for automatically highlighting a few “interesting” data points

from a large data set. The skyline of a data set is the subset ofdata points that are not dominated on all of their

attributes by any other data point. Intuitively, the skyline consists of the data points that have a uniquely interesting

combination of attribute values that no other data point canmatch, by having more preferable values in all the

attributes. The skyline query was originally developed fornumerical data, but its application to textual data is

equally appealing: it can be used to surface a subset of uniquely interesting extended documents based on their

meta-data values.

As an example, consider a collection of news articles. An article is associated with meta-data attributes

describing its source and the geographical area associatedwith the event. Meta-data capturing the output of Infor-

mation Extraction algorithms, such as the article’s topic (e.g., sports, politics, etc.), the entities mentioned or the

author’s subjectivity (sentiment), are also likely to be present. Given anorderingof each attribute’s domain values

according to our preferences, a skyline article is of interest because, e.g., it was published in a high quality news

source and is related to a desirable geographical area, eventhough the topic itself might not be that preferable.

The skyline is a valuable concept not only for static document collections, but also, perhaps more so, for

streaming, dynamic document collections. To illustrate consider the following extension of our news articles

example. A service aggregates and displays news articles asthey are published by news sources. News articles

are continuously streaming into the system and become part of the document collection. An expert has defined

119

CHAPTER 6. SKYLINE MAINTENANCE FOR DYNAMIC DOCUMENT COLLECTIONS 120

an ordering of the meta-data domain values expressing the service’s preferences and defining its unique style.

The system will then select for display, or more extensive filtering by an expert, the most “interesting” news that

comprise the skyline of the mostrecentarticles. Older articles become quickly irrelevant, therefore the skyline of

articles from the past 24 hours is needed.

The are two main challenges in computing the skyline of such dynamic document collections. First, recom-

puting the skyline from scratch each time a new document arrives or an old one expires is wasteful. Instead, the

skyline needs to beincrementally maintainedafter such updates. Second, extended document meta-data are typ-

ically categoricalin nature and, hence, preferences can define apartial orderingof their domain. A hierarchical

categorical domain is possibly the most familiar example ofa partially ordered domain. Complex relationships

and hierarchal structure cannot be captured by a simple mapping of categorical values to numbers and this im-

mensely complicates the skyline computation problem.

In this Chapter, we identify and study the problem of maintaining the skyline of streaming extended docu-

ments associated with partially ordered, categorical meta-data and realize two novel techniques that constitute the

building blocks of STARS (Streaming Arrangement Skyline),the proposed efficient solution to the problem.

• In particular, we assume a sliding window model of stream computation and introduce a lightweight, grid-

based data structure for indexing the extended documents inthe system buffer. Our initial, basic indexing

solution is progressively refined: we first identify and utilize a property that is unique to the grid-oriented

structure of the index and subsequently develop techniquesthat offer flexibility in controlling the granularity

of the grid. The resulting indexing structure can gracefully adapt to documents with many attributes and

partially ordered domains of any size and structure.

• We subsequently study the dominance relation between two extended documents with partially ordered

meta-data attributes in thedual space: documents are mapped to lines and the interaction of their corre-

sponding lines is used in order to infer their dominance relation. This mapping allows us to utilize powerful

tools from computational geometry, known asgeometric arrangements, in order to organize and query the

skyline efficiently. As we discuss, the arrangement-based organization of the skyline allows us to answer

dominance queries by considering only a small fraction of the skyline documents.


discuss existing work. In Section 6.3 we introduce the basicdefinitions and notations that we will use throughout

this chapter and offer a high level discussion of the skylinemaintenance task. The core techniques comprising our

solution are presented in Section 6.4. Section 6.5 presentsour experimental evaluation of the proposed solution,

while Section 6.6 concludes the chapter.



Given the utility of skyline queries, previous work has concentrated on the efficient evaluation of skyline queries

in both offline [28, 43, 63, 143, 92, 121, 95] and dynamic streaming environments [144, 99, 114]. In an offline

environment data resides on disk and skyline queries are answered on demand, while in an dynamic streaming

environment the skyline of the most recent stream data is continuously maintained up to date. More recent work

has concentrated on variations of the original query [37, 36, 100].

Most of the above work reasoned about data withnumericalattributes and totally ordered domains, and relied

on the cleanness of the dominance relation between data points, induced by the linearity of the numerical domains,

in order to derive efficient solutions. However, in text applications extended document meta-data can either

include or be exclusively comprised of attributes that arecategoricalandpartially orderedin nature. Complex

relationships and hierarchal structure cannot be capturedby a simple mapping of categorical values to numbers

and this immensely complicates the skyline computation problem.

Recent work [35] considered the on-demand evaluation of skyline queries on data points with partially or-

dered categorical attributes, but in anoffline environment. As we will subsequently argue and experimentally

demonstrate, these techniques are inappropriate for a highly dynamicstreaming environmentwhere documents

constantly flow into the system.

6.3 Background

6.3.1 Definitions

A stream of extended documents of a dynamic document collection arrive at high rates. We maintain a limited-

capacity, sliding-window bufferB in memory that only stores then most recent documents from the stream. When

a new document arrives from the stream, the oldest document in the buffer is removed in order to free up space

for the incoming one. At any time instance, the contents of the buffer constitute an extended document collection

denoted byD. The data set is comprised ofn extended documentsd1, . . . , dn with m categoricalmeta-data

attributesX1, . . . , Xm.

The domainDomi of each of the attributes ispartially orderedand constitutes apartially ordered set, also

referred to as aposet. Each domainDomi is associated with a binary relationi. Leta, b, c be three elements of

Domi. The partial order relationi is transitive(a i b andb i c impliesa i c) , reflexive(a i a holds)

andantisymmetric(if a i b andb i a, thena = b). We further denote with≺i the strict ordering relation, i.e.,

a ≺i b implies thata i b anda 6= b. We will also refer to the relationa ≺i b asb dominatesa. Additionally,

we say thata andb arecomparableif either a ≺i b or b ≺i a andincomparableotherwise. Lastly, we will also


denote the relationa i b asb i a.

Posets are commonly represented as directed acyclic graphs. Each domain value is mapped to a vertex and

a directed edge is introduced for each pair of comparable values whose relationcannotbe inferred by using the

transitive property of the partial order relation. The following example clarifies the aforementioned definitions.

Example 6.1. Consider the poset of Figure 6.1. Valuesa and b are comparable anda dominatesb. On the

contrary, valuesb andc are incomparable. Furthermore, due to transitivity in the ordering relation,a dominates

d, but an edge on the graph betweena andd would be redundant, since their relation can be easily inferred. One

can notice that given a node, every other node in the poset that is reachableby it, is also dominated by it.

a

b c

d

Figure 6.1: A simple poset.

The definitions can be easily extended to documents whose meta-data are comprised ofm categorical at-

tributes. We will say that documentd1 dominatesdocumentd2 and writed2 ≺ d1 or d1 ≻ d2, if for every

attributeXi, d2.Xi i d1.Xi and there is at least one attributeXj such thatd2.Xj ≺ d1.Xj . When two docu-

mentsd1 andd2 do not dominate one another, we will say that they aretied and writed1 ∼ d2. Then, theskyline

of D is the subset of all documents that are not dominated by any other document.

Notice that the dominance relation between documents is transitive and not symmetric. This implies that given

an extended document setD, its skylineS and a documentd 6∈ D, in order to find out ifd belongs to the skyline

of D ∪ d we only need to check ifd is dominated by documents inS. If it is dominated by a document inS,

then of course it cannot be part of the skyline. However, if itis not dominated by any document inS, then it is not

dominated by any document inD and should therefore become part of the skyline ofD ∪ d

Example 6.2. Consider three extended documentsd1, d2 andd3 with two categorical attributes. The domain of

both attributes is the poset of Figure 6.1. Let us assume thatd1 = (a, b), d2 = (b, d) andd3 = (b, c). Then, both

d1 andd3 dominated2, butd1 andd2 are tied (valuesb andc are incomparable). Therefore, the skyline of this

small data set consists of documentsd1 andd3.


6.3.2 Skyline maintenance

At the core of existing solutions that maintain the skyline of streaming data withnumericalattributes [99, 114,

144] is a simple framework that can also be utilized for streaming documents withcategoricalattributes. As a

matter of fact, the framework is independent of the definition of dominance. The definition only becomes relevant

in the realization of the abstract framework that we describe. At any given time, the bufferB contains then most

recent documents from the stream. LetS be the skyline of the documents in the buffer. A buffer update, i.e., the

insertion of a new document in the buffer and the expiration of the oldest one, can affect the skyline in a limited

number of ways.

In particular, the incoming document can either be dominated by at least one document in the skyline and

therefore fails to affect the skyline, or is not dominated byany documentin the skyline and should therefore

become part of the skyline itself (as we argued in Section 6.3.1). In that case, the incoming document might also

dominate documents currently in the skyline which must of course be removed. Respectively, if the outgoing

document does not belong in the skyline, then its expirationhas not effect. However, if the document is part of

the skyline, then all documents in the buffer dominatedexclusivelyby the outgoing document (i.e., dominated by

the outgoing document, but no other document in the skyline)must be inserted in the skyline.

Therefore, a skyline maintenance solution must efficientlysupport two operations: (i) checking whether a

document is dominated by the current skyline and (ii) retrieving the documents in the buffer that are dominated

by the outgoing skyline document, since only these documents are candidates for entering the skyline. The ability

of any technique to perform this second task efficiently can be augmented by utilizing the following observation.

Lemma 6.1. Let d1, d2 ∈ B be two extended documents so thatd1 ≺ d2. Then, ifd2 arrived afterd1 in the

stream,d1 will never be in the skyline ofB.

Proof. Sinced1 arrived befored2, it will also leave the buffer befored2. Therefore, whilet1 is in the buffer,

there will be at least one document, namelyd2, that dominates it and consequently cannot become part of the

skyline.

The lemma implies that a significant number of documents in the buffer is irrelevant for the skyline mainte-

nance task, since they can never become part of the skyline. We will refer to the relevant part of the buffer as the

skybuffer. Thus, when we need to mend the skyline after the expiration of a skyline document, we only need to

consider documents in the skybuffer instead of the entire buffer. The incremental maintenance of the skybuffer

is simple. When a new document arrives from the stream, it is inserted in the skybuffer, while all the skybuffer

documents dominated by it are removed. When a document expires, it is simply removed from the skybuffer.

Algorithm 6.1 summarizes the high level strategy that is employed in order keep the skyline of the buffer up


to date. Notice that the skylineS is a subset of the skybufferSB. After a buffer update, any changes to the

composition of the skyline can be optionally reported.

Algorithm 6.1 Skyline maintenance frameworkInput: skybufferSB, skylineS ⊆ SB, incoming documentin, outgoing documentout

if in not dominated byS then

Insertin in S and remove any dominated documents fromS ;

end if

Insertin in SB and remove any dominated documents fromSB;

if out is in S then

Find documents inSB dominated byout and use them to mendS ;

end if

Removeout from SB;

6.4 Efficient Skyline Maintenance for Extended Documents

In this section we present two novel techniques for realizing the building blocks of the skyline maintenance

framework: indexing the skybuffer so that we can efficientlyidentify the skybuffer documents dominated by a

query document and organizing the skyline in order to be ableto rapidly answer whether a query document is

dominated by the skyline.

We initiate the presentation of the proposed techniques by briefly reviewing the notion of the topological sort

of a single poset and discussing the extension of this idea todocuments associated with multiple partially ordered

meta-data attributes.

Based on these results, we introduce our baseline grid-based solution for indexing the skybuffer and discuss

the advantages of this approach over potential alternatives. This basic solution is progressively refined. The first

enhancement exploits the unique structure of the grid indexto optimize query evaluation byfocusingon specific,

relevant cells instead of an entire region of cells, many of which can be irrelevant. The index is further refined by

developing a domain partitioning technique that offers absolute control over the granularity of the grid. Part of the

technique is an algorithm for constructing anoptimalposet partition that minimizes the expected query evaluation

cost.

We then focus our attention on designing an efficient skylineorganization. This is accomplished by identifying

the connection between the dominance relation of two extended documents and the interaction of theirdual space

representation aslines. The mapping of documents to lines allows us to utilize powerful tools from computational


geometry, known asgeometric arrangements, in order to organize and query the skyline efficiently. As wediscuss,

the arrangement-based organization allows us to answer dominance queries by considering only a small fraction

of the skyline documents.

6.4.1 Topological Sorting

A topological sort[14] is a numbering of the vertices of a DAG such that every edge from a vertex numberedi

to a vertex numberedj satisfiesi < j. Figure 6.2 presents a poset and two possible topological sorts. A poset

can have a large number of valid topological sorts, althoughfor our purposes any one of them will be equally

appropriate. Finding a topological sort for a poset is a linear cost operation [14].

a b

c d e

f g h

a b c d e f g h

a c f b d e h g

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

(i)

(ii)

Figure 6.2: Topological sorting.

Informally, a topological sort of a poset is a linear ordering of its values which is compatible with its partial-

ordering relation. In other words, if for two valuesa, b we havea ≻ b, thena will appear beforeb in the ordering.

Therefore, the guarantee we obtain is the following: for a given valuex, all values dominated byx will appear

after it in the linear order, whilex can never dominate any value that precedes it in the ordering. Conversely, no

value appearing afterx can ever dominate it, while all values that dominatex appear before it. This intuition is

captured by the following lemma.

Definition 6.1. Let v be a value of a partially-ordered domain. We denote byr(v) the integer corresponding to

v’s position in a certain topological sort of the domain.

Lemma 6.2. Let v1, . . . , vm be them values of a partially-ordered domainDom. Thenvi ≻ vj only if r(vi) <

r(vj).

Consider for example valued and the topological sort (ii) in Figure 6.2.d does not dominate any of the values

that appear before it and all values that dominate it appear before it (a, b in that case). Furthermore, all values

dominated byd (only g in this case) appear after it in the order.

These observations and reasoning concerning a single domain, can be extended to multiple attribute domains.

Consider a set of documents withm partially-ordered categorical attributes. For a documentd1 to dominate


another documentd2, it needs to dominated2 in every attribute1. In order for this to be possible, allm attributes

of d1 must be located befored2’s attributes in the corresponding topological sorts. As before, this is not anif and

only if relation. The guarantee we have is that ford1 to dominated2, its meta-data values must be located before

d2’s values in the corresponding linear orders. Formally:

Lemma 6.3. Let d1, d2 be two extended documents withm partially-ordered categorical meta-data attributes

X1, . . . , Xm. Thend1 ≻ d2 only if r(d1.Xi) < r(d2.Xi), 1 ≤ i ≤ m.

This relation has another very useful implication: given two documentsd1 andd2, if there is a disagreement

in the ordering of the values for two of their attributes, then one cannot dominate the other and are therefore tied.

Lemma 6.4. Let d1, d2 be two extended documents withm partially-ordered categorical meta-data attributes

X1, . . . , Xm. If ∃i, j such thatr(d1.Xi) < r(d2.Xi) andr(d1.Xj) > r(d2.Xj), thend1 ∼ d2.

6.4.2 Organizing the Skybuffer Documents in a Grid

One of our goals in developing an efficient skyline maintenance solution is indexing the skybuffer so that we can

efficiently insert and delete documents, as well as identifythe documents dominated by a query document. The

linearization of poset values that we introduced will allowus to do so.

A documentd(X1, . . . , Xm) in the skybuffer can be mapped to a point(r(d.X1), . . . , r(d.Xm)). Then, in

order to identify the skybuffer documents that are dominated by a query documentq(X1, . . . , Xm), we only need

to consider documents/pointsd with r(q.X1) ≤ r(d.X1), . . . , r(q.Xm) ≤ r(d.Xm). All these documents will be

checked for dominance againstq in order to identify the ones dominated. Notice however thatthis is precisely a

rectangular range query that can be efficiently supported bya variety of spatial data structures, like a grid or an

R-tree.

We chose to build our skybuffer indexing solution around a simple grid. The reason for doing so is twofold.

Firstly, previous research argues [48, 93, 115] that in a streaming environment, the potential query performance

gains by using a more sophisticated data structure are offset by the heavy maintenance costs induced by the data

volatility, which is inherent in a streaming application. Secondly, the unique structure of the grid interacts favor-

ably with the properties of our problem. This interaction will allow us to subsequently introduce optimizations

and refinements whose applicability is exclusive to our grid-based index.

Let us discuss a concrete example that will help clarify how agrid can be used to index the skybuffer and

identify the skybuffer documents dominated by a query document.

Example 6.3. Consider a set of documents with two categorical attributes, the domain of both attributes being

1To simplify the discussion, we ignore the case of equal attribute values.


the poset of Figure 6.2. Suppose that the poset values have been mapped to integers according to topological sort

(i) of Figure 6.2. Then, we create the grid by using this topological sort as the grid scales and place the skybuffer

documents in its cells. Figure 6.3 illustrates. Each grid cell corresponds to a unique combination of attribute

values and only contains documents with these exact values.As an example, the cell corresponding to documents

with meta-data attribute values(d, e) has been marked with an “×” in the figure.

X

abcdefgha

b

c

d

e

f

g

h

Figure 6.3: Organizing the skybuffer as a grid.

In our “2-dimensional” example, all documents dominated bya certain documentd must have values inboth

attributes that appear after the corresponding values ofd in the linear order. However, documents with this

property are placed in cells lying in a single rectangular area of the grid. In our running example of Figure 6.3, all

documents that are dominated by document(d, e) are located in the rectangular area whose upper-right corner is

cell (d, e). Notice that this does not imply thatall document in that area are dominated. It merely means that the

documents that are dominated by(d, e) must lie in that area. As a matter of fact, only documents located in the

cells marked with• in Figure 6.3 are dominated.

Summarizing our progress so far, we demonstrated how a grid can be used to index the skybuffer documents.

The scales of the grid for each dimension is the linearization of the corresponding poset. Then, in order to identify

the skybuffer documents that are dominated by a query document, we need to issue a rectangular range query,

which the grid can efficiently support, and only consider thedocuments lying in the query area. Insertions and

deletions can also be carried out extremely efficiently.


6.4.3 Improving the Skybuffer Organization

Visiting only relevant cells

An additional advantage of the grid-based index is that instead of issuing a rectangular range query in order to

visit the cells that potentially contain dominated documents, we can directly identify and process precisely the

cells that contain dominated documents. We will refer to this unique capability of the grid-based index as the

ability to performfocused search.

Let us revisit the example of Figure 6.3. We claim that we can directly visit the cells marked with• instead of

every cell in the rectangular region. The query document is(d, e). The domain values dominated by valued (and

includingd) ared, g. Respectively, the domain values dominated bye aree, g, h. Then, due to the definition

of dominance (Section 6.3.1), only documents with values ind, g × e, g, h are dominated by(d, e). This is

captured by the following lemma.

Lemma 6.5. (Focused Search)Let d be an extended document withm partially-ordered categorical meta-data

attributesX1 ∈ Dom1, . . . , Xm ∈ Domm. Let dom(d.Xi) be the values inDomi such thatd.Xi v, v ∈

Domi. Then,d dominates a documents if and only ifs ∈ dom(d.X1)× · · · × dom(d.Xm).

The lemma allows us to identify and focus on the cells in the grid containing documents dominated by the

query document, which is clearly much more efficient than issuing a rectangular range query and considering

documents that are irrelevant to the query.

Controlling the grid granularity

A problem with the grid-based index - as presented so far - is the lack of control over the granularity of the grid.

The scales of the grid for each attribute are directly derived from a topological sort of the corresponding domain

and introduce as many buckets per scale as values in the domain. While this fine granularity might be acceptable

for extended documents with few attributes and domains witha handful of values, it is obvious that the solution

does not scale. The number of cells in the grid for documents with m attributes is|Dom1|× · · ·× |Domm|. As an

example, if the documents have 4 attributes and each domain size is about 500, then the grid would be comprised

of 62.5 billion cells, which is clearly infeasible.

In general, the grid granularity has significant impact on the performance of any grid-based indexing solution.

Since pruning at query time occurs at the cell level, coarsergranularity and therefore bigger cells result in less

effective pruning. On the other hand, setting a finer granularity produces a greater number of smaller cells.

Besides increased memory requirements, such an arrangement results in increased query time, since accessing a

cell is associated with an overhead. There is always a well-performing granularity range where these competing


trends do not dominate query time. This range is certainly application specific, but can also be data dependent, so

it is paramount to provide flexibility in controlling grid granularity.

a b

c d e

f g h

Depth 0

Depth 1

Depth 2

X

a,bc,d,ef,g,h

a,b

c,d,e

f,g,h

(a) (b)

Figure 6.4: Depth-based value grouping.

We begin the introduction of our granularity control mechanism by conducting a few simple observations.

First, let us formally define thedepthof a domain value.

Definition 6.2. Consider a DAG and its vertices. A vertex is a source if it has no incoming edges. Then, the depth

of a vertex in a DAG is the length of thelongestpath from a source to the vertex.

Figure 6.4(a) depicts an example poset and the associated depth of its domain values. In the figure we can

observe that a value only dominates other values that are located “deeper” in the poset. This implies that sorting

the vertices in increasing depth values (and breaking ties arbitrarily) produces a perfectly valid topological sort.

Using this insight, we can restate Lemma 6.2 as follows.

Lemma 6.6. Let v1, . . . , vm be them values of a partially-ordered domainDom. Thenvi ≻ vj only if

depth(vi) < depth(vj).

The lemma guarantees that given a poset value, all values that it dominates have to be located deeper in the

poset, but this doesn’t imply thatall values located deeper are dominated. Lemmata 6.3 and 6.4 canalso be

restated in terms of depth, but we do not do so in the interest of space. The implication of these results is that we

can create and use the grid using scales at the granularity ofa depth level. Let us discuss a concrete example.

Example 6.4. Consider a set of extended documents with 2 partially-ordered categorical attributes, the domain

of both of them being the poset of Figure 6.4(a). Then, we can group domain values that lie on the same depth

and create the grid of Figure 6.4(b). In Section 6.4.2, everycell would contain documents with the same attribute

values. Now, documents with corresponding attributes thatlie at the same depth level are placed in the same cell.

For example, document(d, e) lies in the cell marked with “×” in Figure 6.4(b), along with other documents with

values inc, d, e × c, d, e.


This reduction of the grid granularity does not come withouta price. When querying the grid to retrieve the

documents dominated by the query document, we can no longer perform focused search and only access cells that

exclusively contain dominated documents. Instead, we mustaccess all cells containing documents with attribute

values located deeper in their corresponding posets. This is equivalent to a full rectangular range query (Figure

6.4(b)). Furthermore, the cells can contain documents bothdominated and not dominated by the query document

and therefore a dominance check against all the document in the cells is required to identify the dominated

documents.

Poset partitioning

The proposed depth-based grouping of poset values improvesthe initial solution, but does not allow to explicitly

set the desired grid granularity and is tied to the structureof the posets - a constraint that can introduce problems.

As an example, consider a poset with only two depth levels, but many values spread evenly between the levels.

The solution of Section 6.4.3 would produce oversized grid cells and therefore reduced pruning efficiency at query

time.

In Section 6.4.3, we grouped all the values of a poset having the same depth. We can further refine this partition

of values into groups by creating more than one group per depth level. Let us see the potential advantages of such

an approach with the following example.

Example 6.5. Suppose that our data set contains extended documents with two partially-ordered categorical

meta-data attributes, the domain of both being the poset of Figure 6.5(a). The values of the domain have been

grouped as illustrated in the figure. If we order the groups inascending order of their depth value, breaking ties

arbitrarily, we again come up with a valid topological sort for the poset. We can therefore use the same rationale

as before to create and query the grid.

a b

c d e

f g h

Depth 0

Depth 1

Depth 2

abc

X

d,eg,h fa

b

c

d,e

f

g,h

(a) (b)

Figure 6.5: Refined depth-based value grouping.

Figure 6.5(b) presents the resulting grid by using as scalesthe grouping of Figure 6.5(a). As an example,


document(d, e) lies in the cell marked with “×”. As before, all cells that can potentially contain documents that

are dominated by(d, e) are located in the rectangular area highlighted in Figure 6.5(b).

However, unlike the example we studied in Section 6.4.3, notall cells in the rectangular area contain docu-

ments that can be dominated by the query document. The cells that contain candidates are marked with•. This

allows us to use focused search in order to directly access relevant cells instead of issuing a range query. Note

that cells will contain documents that are both dominated and not dominated by the query document, but we can

completely ignore cells that exclusively contain not dominated ones.

Formally, letDom be a partially ordered domain with valuesv1, . . . , vl that has been partitioned intok

groups,g1, . . . , gk. We say that groupgi dominates another groupgj if there exists one value invi ∈ gi and

another valuevj ∈ gj such thatvi dominatesvj . Let dom(g) be the union of the groups dominated byg,

includingg itself. Furthermore, letg(v), v ∈ Dom be the group that valuev belongs to. Then, in order to locate

the documents dominated by a query documentq(X1, . . . , Xm) we only need to visit the grid cells that contain

groupsdom(g(q.X1))×· · ·×dom(g(q.Xm)). The intuition is that we only need to check documents in the groups

where the values dominated byvi lie. To ease notation we will denote the values indom(g(vi)) asdomg(vi) and

say that the values aregroup-dominatedby vi.

A natural question that arises is that given a budget ofB buckets for a poset, how should the poset be par-

titioned into groups. Not all groupings will offer equally good pruning opportunities that the focused search

procedure can exploit. In the worst case, we can come up with ascenario where even though we have used more

buckets than depth levels, we still end up visiting as many cells as we would visit by issuing a range query on the

grid. This case would be equivalent in cost to the simpler depth-based grouping.

Let us concentrate on a single attribute with valuesv1, . . . , vl and a groupingg1, . . . , gk. Had we created the

grid at the finest granularity possible, then, given a query document with valuevi for the attribute, we would only

examine (check for dominance) documents with values indom(vi). However, because of the grouping of domain

values we also need to examine all documents with values thatare group-dominated byvi, i.e., documents with

values indomg(vi).

Let us assume that the domain values are uniformly distributed. Then, the number of documents that must

be checked for dominance isnl |domg(vi)|, i.e., nl documents, wheren is total number of documents in the

skybuffer, for every value indomg(vi). Furthermore, the probability that the query document has attribute value

vi is 1l . Therefore, the expected number of documents that must be checked for dominance in every query is

E =∑l

i1lnl |domg(vi)|. Sincen andl are constants, in order to minimizeE we need to come up with a grouping

that minimizes∑

i |domg(vi)|, which is the sum of the number of group-dominated values, from every value in

the domain. Unfortunately, we can demonstrate that even forsimple instances, the problem isNP -complete.


Lemma 6.7. Given a partially ordered domainDom with valuesv1, . . . , vl, the problem of identifying a partition

of the domain intok groupsg1, . . . , gk, so that∑m

i |domg(vi)| is minimized, is NP-complete.

Proof. A simple instance, involving a domain with only two levels, can be reduced from the “maximum k-set

packing” problem, which is NP-Complete [15].

To compensate, we developed a partitioning heuristic that we found to work extremely well in practice, as we

experimentally demonstrate in Section 6.5. The heuristic has two main steps: it first allocates a fraction of the

buckets available to each level and then performs a greedy, bottom-up partition of the poset, i.e., it starts from the

deepest level and partitions it, then partitions the level above and so on, until the uppermost level is partitioned.

In order to partition a level, the heuristic leverages localinformation coming from the already partitioned level

below and the level above.

The bucket allocation strategy that we selected is the following: each level is initially assigned one bucket

and the remaining buckets are distributed among the depth levels proportionally to the number of nodes that they

contain. The outline of this partition framework is illustrated in Algorithm 6.2.

Algorithm 6.2 Partitioning heuristic frameworkInput: Poset with valuesv1, . . . , vl, number of bucketsB

Output: Disjoint value groupsG1, . . . , GB

Variables: Maximum depth level of the poseth, set of groups at depthi, Gi

Group together values according to their depth into groupsD1, . . . , Dh;

Allocate to leveli, bi = 1 + ⌈ |Di|m

(B − h)⌉ buckets;

for i = h down to 1do

Utilize information from domain values inDi−1 (level above) and groups inGi+1 (level below);

Partition values inDi into bi groupsGi,1, . . . , Gi,bi and place them inGi;

end for

Let us now concentrate on how a depth level is partitioned given a budget ofk buckets for that level. The

partitioning is based on local information, coming from thelevel above and the level below that has already been

partitioned. For each pair of nodes we define a notion of benefit that we can expect by placing these nodes into

different groups.

Example 6.6. Consider the example of Figure 6.6. Suppose that we want to measure the benefit of separating


b1 b2

a1 a2

c1 c2 c3 c4

b1 b2

a1 a2

c1 c2 c3 c4

(a) (b)

Figure 6.6: Poset partitioning heuristic.

nodesb1 andb2 (Figure 6.6(a)). Remember that we need to minimize the number of poset values that each value

group-dominates. Then, if we place them in the same group (Figure 6.6(b)):

• Nodeb1 will not be affected, since it already group-dominates nodes c1 to c4 in the lower level. Therefore

the benefit by separatingb1 andb2 is 0.

• Nodeb2 however, will now group-dominate nodesc1 andc2. This would increase the cost function by at

least 2. Therefore, the benefit for separatingb1 andb2 is +2.

• Nodea1 already group dominates nodesb1 andb2. Therefore no benefit is derived be separating them.

• If b1 and b2 are grouped together, nodea2 will group-dominate nodeb1, besides nodeb2. Therefore, by

separatingb1 andb2 we derive a benefit of +1 fora2.

• The total benefit for separatingb1 andb2 is now 0+2+0+1=3.

Therefore, for each pair of nodes in a level, we can connect them with an undirected edge weighted by the

benefit we can derive by separating the nodes. Then, a good partitioning for this level can be done by considering

themaximum k-cutof the nodes. The maximum k-cut will partition the nodes of the level intok groups so that

the total weight of edges spanning two groups is maximized. Effectively, this procedure maximizes the sum of

the pairwise benefits for this level.

Generating a maximumk-cut of a graph withn nodes is an NP-complete problem. [59] introduced a semi-

definite programming (SDP) relaxation algorithm that provides a1−1/k+2 lnk/k2 approximation to the optimal

solution. However, a much simpler greedy heuristic can identify a 1− 1/k approximation to the optimal solution,

a guarantee that is only marginally worse than the one offered by the SDP relaxation. The greedy algorithm

considers the nodes in arbitrary order and places them in oneof thek groupsg1, . . . , gk. A nodev is placed in

the group whose nodesvg minimize the sum∑

vg∈g w(v, vg), wherew(v, vg) is the weight (benefit) of the edge


betweenv andvg, thus maximizing the weight of the “cut” edges. The procedure for partitioning the poset values

of a depth level is illustrated in Algorithm 6.3.

Theorem 6.1. The weight of thek-cut produced by Algorithm 6.3 is greater than(1− 1/k)OPT .

Proof. The weightC of thek-cut is the sum of the edges that span different groups. LetG(V,E) be the graph

andv1, . . . , vj , . . . , vl the order in which its nodes are processed. We define a disjoint partition of edgesE into

groupsEj = vivj |i < j. Then, the placement of nodevj in a group contributes byCj ≥ (1−1/k)w(Ej) to the

weight of the cut, wherew(Ej) is the sum of edge weights inEj . This is a direct consequence of our placement

strategy. ThenC =∑l

j=1 Cj ≥ (1 − 1/k)∑l

j=1 w(Ej) = (1 − 1/k)w(E), wherew(E) is the sum of all edge

weights in the graph. However,w(E) ≥ OPT , therefore,C ≥ (1 − 1/k)OPT .

Algorithm 6.3 Level partitioning algorithmInput: Poset valuesv1, . . . , vl, number of bucketsk

Output: Disjoint value groupsg1, . . . , gk

For every pair of nodesvi, vj , calculate benefitw(vi, vj);

Initialize groups to be empty;

for i = 1 up tol do

Placevi in groupgj such that∑

vj∈gjw(vi, vj) is minimum;

end for

Our partitioning heuristic assumes that attribute values are uniformly distributed. Nevertheless, potential

knowledge about the value distribution could be incorporated in the partitioning process by modifying in an

appropriate manner the benefit values between the nodes.

Lastly, we have assumed that the number of buckets that are allocated for partitioning a poset is greater than

the poset’s maximum depth, so that at least one bucket is available per depth level. In the rare case that fewer

buckets than levels are available, we can merge into a singlegroupconsecutivedepth levels. A natural strategy

that can utilize this observation would be to merge consecutive levels into groups, so that every group contains

approximately the same number of poset values. This is identical to the process of producing an equi-depth

histogram [83].


6.4.4 Arrangement Representation of the Skyline

The second building block of the maintenance framework is anefficient skyline organization. The employed

indexing structure must be able to determine whether a querydocument is dominated by the skyline, checking

as few skyline documents as possible. This operation is essential for good performance as it is conducted for

each incoming document, as well as during the skyline mending procedure that occurs after a skyline document

expiration. To achieve this goal, we study the dominance relation in thedual spaceand design an efficient solution

that utilizesgeometric arrangements[2, 69].

Arrangements of lines

The arrangementA(L) of a finite collection of linesL, is the partition of the plane induced by the lines in

L [2, 69]. The lines decompose the 2-dimensional plane into 0-dimensionalvertices(intersections of lines),

1-dimensionaledges(line segments between vertices) and 2-dimensionalfaces(the convex tiles of the plane

bounded by the intersecting lines). Figure 6.7(a) presentsan arrangement of three lines, with faces highlighted in

grey. Arrangements are well studied structures and there exists a wealth of combinatorial results that reason about

their complexity, as well as a fair number of main memory datastructures for storing and operating on them.

ε1

ε2

ε3

(a) Arrangement example

f1

e1

e2e3

e'1

e'2e'3

(b) The DCEL data structure

Figure 6.7: Arrangements of lines.

An arrangement ofs lines is composed ofO(s2) vertices,O(s2) edges andO(s2) faces. An interesting

substructure of an arrangement is thezoneof a line not present in the arrangement. The zone is comprised of the

faces stabbed by the line, and the celebrated Zone Theorem states that these faces are in turn comprised of only

O(s) edges [2, 69]. It is partly because of the favorable combinatorial bound promised by the Zone Theorem,

that arrangements are of great practical, besides theoretical interest: insertions and deletions of lines, and many

interesting queries on the arrangement can be performed in linearO(s) time.

Although a variety of data structures is available for representing an arrangement, the one most suitable for


our purposes is thedoubly-connected-edge-list(DCEL) data structure [50] (Figure 6.7(b)). At its core, thear-

rangement is a planar graph consisting of vertices and undirected edges connecting these vertices. The DCEL

uses a pair of twin directed half-edges, moving in opposite directions, to represent each edge connecting two

vertices. Furthermore, the DCEL maintains additional incidence and connectivity information in order to offer

great flexibility in traversing the arrangement. For example, all the half-edges comprising the internal boundary

of a face are linked in a circular list (edgese1, e2 ande3 in Figure 6.7(b)) and each half-edge maintains a link

to its twin (edgesei ande′i in Figure 6.7(b)). This information suffices to enable the traversal of the zone of an

external line and consequently the identification of all lines in the arrangement intersected by the external line.

To summarize, the DCEL can store the arrangement ofs lines usingO(s2) space and can perform insertions,

deletions inO(s) time. Furthermore, it can identify all lines intersected bya query line inO(s) time. Although

the combinatorial bounds associated with the arrangementsmight not seem attractive, arrangements have been

successfully utilized in a demanding streaming environment in order to perform operations similar to the ones

required in the current problem [48]. Given that in our scenario s will be equal to the size of the skyline, which

should be normally small, we can expect the arrangement based skyline organization to perform well, in addition

to the advantages that we subsequently present.

Dominance checking in the dual space

In the context of computational geometry and related disciplines, thedual space is a symmetrical version of the

original (orprimal) problem space, where each point in the primal is mapped to a line in the dual and vice versa.

Primal/dual transformations are used widely as they offer fresh insight into the problem and point to solutions that

are not easy to conceive in the primal space.

We have already discussed in Sections 6.4.1 and 6.4.2 how extended documents withm partially-ordered

categorical meta-data attributes can be mapped tom-dimensional points by utilizing a topological sort of the

attributes and representing each attribute value with its position in the corresponding linear order. Lemmata 6.3

and 6.4 utilized this representation to reason about the possible dominance relation between two documents, given

their point representation.

Let us initially concentrate on documents with two attributes. Remember that we denote withr(v) the position

of valuev in a topological sort of its corresponding domain. We define the following mapping: each document

d(a, b) is mapped to a liney = r(a) · x − r(b) in the dual cartesian plane(x, y). Then, we can prove that two

documents are comparable (one dominates the other) if the intersection of their corresponding lines lies in the

positive half of thex axis in the dual plane. As with all our results, this is not an iff relation. However, we can be

sure that if the intersection point lies on the negative halfof the axis, then the documents are definitely tied.


Lemma 6.8. Consider two extended documentsd1, d2 with two categorical attributesX1, X2. Then,d1, d2 are

comparable only if the intersection point(xI , yI) of linesy = r(d1.X1) · x− r(d1.X2) andy = r(d2.X1) · x−

r(d2.X2) hasxI > 0.

Proof. Using simple algebra, we can find that thexI coordinate of the intersection point isxI =r(d1.X2)− r(d2.X2)

r(d1.X1)− r(d2.X1).

Notice that ifxI < 0, thenr(d1.X2) < r(d2.X2) andr(d1.X1) > r(d2.X1) or r(d1.X2) > r(d2.X2) and

r(d1.X1) < (d2.X1). In either case, Lemma 6.4 states that documentsd1 andd2 are tied.

If xI > 0, thenr(d1.X2) < r(d2.X2) andr(d1.X1) < r(d2.X1) or r(d1.X2) > r(d2.X2) andr(d1.X1) >

(d2.X1). Again, in either case, Lemma 6.3 states that the documents can be comparable.

Lemma 6.8 points to a technique that allows us to prune a significant fraction of the skyline documents

when checking a query documentd for dominance: we can map the skyline documents to lines and store in the

arrangement only the part of the lines that lies on the positive half of thex-axis. Then, in order to answer whether

d is dominated by the skyline, we mapd to a line and query the arrangement to retrieve the lines/documents

intersected byd. Their intersection point is guaranteed to havexI > 0, since only the positive part of the lines

is stored in the arrangement. Consequently, documents not intersected byd in the arrangement, and which are

definitely tied withd, are pruned. An additional advantage is that the query on thearrangement returns the

intersected documents progressively, therefore the computation can stop as soon as a document that dominatedd

is found.

Example 6.7. Consider a skyline consisting of three extended documentsd1, d2, d3. The documents are mapped

to lines in the dual plane and their positive half is stored inan arrangement (Figure 6.8). The vertices of the

arrangement are emphasized using solid bullets. In order todetermine whether a query documentq is dominated

by a skyline document,q is also mapped to a line and the arrangement is queried. Usingthe connectivity in-

formation provided by the DCEL structure to traverse the arrangement, we progressively retrieve the documents

intersected byq, i.e.,d2 andd3 in our example. The intersection ofq with d3 is encountered first, thereforeq will

initially be checked for dominance againstd3. If q is dominated, the query is answered and the traversal stops of

the arrangement stops. Otherwise, the traversal continuesand intersected documentd2 is recovered and checked

for dominance. Notice thatq does not intersectd1 and therefore we do not need to check for dominance against

it.

The technique can be directly applied to documents withm attributesX1, . . . , Xm. In that case, we need to

arbitrarily select two attributesXi andXj and use them consistently to map documentsd to linesy = r(d.Xi) ·

x − r(d.Xj) and store them in an arrangement. The query document is also mapped to a line using the same

two attributesXi andXj and the aforementioned transformation. Then, as in the “2-dimensional” case, the query


x

y

d1

d2

d3

d3 d2

d1

q1 2

Figure 6.8: Utilizing an arrangement for pruning.

document will be tied with the skyline documents that it doesnot intersect in the arrangement and therefore we

only need to progressively check for dominance against intersected documents. This holds since if there is an

ordering mismatch in the corresponding topological sorts for any two of their attributes, then the documents are

tied (Lemma 6.4). Formally:

Lemma 6.9. Consider two documentsd1, d2 with m categorical attributesX1 . . . , Xm. LetXi, Xj be any two

of the attributes. Then,d1, d2 are comparable only if the intersection point(xI , yI) of linesy = r(d1.Xi) · x −

r(d1.Xj) andy = r(d2.Xi) · x− r(d2.Xj) hasxI > 0.

For documents with more than two attributes, we also considered more complex skyline organizations that

utilized multiple low dimensional arrangements, but the improved pruning efficiency and query performance

failed to compensate for the additional maintenance overhead.

6.4.5 Numerical Attributes

So far in our discussion, we have only considered documents associated with exclusively partially-ordered categor-

ical attributes. Nevertheless, the proposed solutions canhandle documents with mixed categorical and numerical

meta-data attributes without requiring any additional modification. This is possible since partial-ordered domains

are a generalization of fully-ordered numerical domains. The DAG representing a numerical domain is simply a

linear chain of values and, as it is evident, the proposed techniques can handle posets of any size and shape.

More specifically, for the arrangement organization of the skyline, when a numerical attribute is used in the

document to line mapping, we can directly use the numerical values instead of their position in the (unique)

topological sort and our analysis remains perfectly valid.On the other hand, when constructing and using the

grid-based skybuffer index, the scales of the numerical attributes will consist of buckets corresponding to disjoint


ranges of numerical values, i.e., as in normal numerical grids. Then, each bucket group-dominates all the other

buckets that contain smaller numerical values.


6.5.1 Adapting Existing Work

In lack of techniques dealing directly with the problem of maintaining the skyline of a dynamic collection of

extended documents, we adapted the offline, categorical skyline evaluation technique of [35] for use in a dynamic

streaming environment.

Previous work on maintaining the skyline of numerical tuples [144] has utilizedR-trees to index the skybuffer

and the skyline. In the numerical domain, identifying the elements of a data set that either dominate or are domi-

nated by a query element is equivalent to a rectangular rangesearch operation that can be efficiently supported by

anR-tree, or any other data structure offering similar query capabilities.

Chan et al. [35] study the problem of evaluatingexternal(i.e., disk-based) skyline queries against tuples with

categorical attributes (extended documents in our application). In their solution, every categorical attribute is

mapped to two numerical attributes,Domi 7→ R2+. Therefore, a tuple withm categorical attributes is mapped

to a tuple with2m numerical attributes. In the suggested solution, the tuples of the data set are indexed in the

numerical domain.

The mapping that [35] employs has the following property. Consider two categorical valuesa, b ∈ Domi that

are mapped to two 2-dimensional pointsp(a) andp(b). If p(a) dominatesp(b), then conclusivelya ≻ b. However,

if p(a) andp(b) are tied, we can make no inference about the relation ofa andb: we could havea ≻ b, a ≺ b or

a ∼ b. This is illustrated in Figure 6.9. Categorical valuea is mapped to pointp(a) in the numerical domain and

partitions the plane into four quadrants. Any value that is mapped to area (I) dominatesa, values mapped to area

(II) are dominated bya, while we cannot draw any conclusions if the value is mapped to quadrants (III) or (IV).

Due to the properties of the mapping, if the skyline itself isorganized in the transformed numerical domain,

false positivetuples will creep into the skyline. In order to compensate, the authors suggest a solution that

organizes the skyline in the original categorical domain. The tuples are partitioned into four disjoint groups that

have the following property: some tuple groups can never dominate some of the other tuple groups. This reduces

the number of tuples that need to be considered when checkingif a tuple is dominated by the skyline or not.

The main components of the solution presented in [35] can be adapted to the skyline maintenance framework.

The skyline can be organized as suggested by the authors. Theskybuffer can be indexed with anR-tree or a grid.

When a document needs to be inserted in the skybuffer, a rangequery identifies and removes documents lying in


p(a)

(I) dominates a

(II) dominated by a

(III) ?

(IV) ?

Figure 6.9: Categorical to numerical domain mapping in [35].

the high-dimensional equivalent of area (II). However, this implies that documents in areas (III) and (IV) that are

dominated by the inserted tuple will remain in the skybuffer, increasing its size. When a tuple is removed from

the skyline, we must use all documents that lie in areas (II),(III) and (IV) in order to retrieve all the documents

in the skybuffer that are dominated by the expiring document. However, in a high dimensional space, these three

areas comprise almost the entire search space.

Although the techniques of [35] are efficient for evaluatingskyline queries on disk-based categorical data,

they do not provide an attractive solution for dynamic streaming document collections. As we subsequently

demonstrate in Section 6.5.4, the increase in the data dimensionality incurred by the mapping, as well as the

complexities of maintaining and querying the skybuffer, result in reduced performance.

6.5.2 Data Generation

We performed our experimental evaluation utilizing both real and synthetic data. Our real data set will be de-

scribed later in the section. For synthetic data, we generated partially-ordered domains with different structures

and shape to cover a wide range of possible settings. Figure 6.10 illustrates two classes we used. The poset of

Figure 6.10(a) has a “tree” structure. Every depth level hastwice the number of values from the level above. On

the other hand, the poset of Figure 6.10(b) has a “wall” structure and all depth levels have the same number of

values. A poset is defined by its structure, i.e., tree or wall, and a triplet(l, h, c), wherel is the number of domain

values,h the height (number of depth levels) of the poset andc is the internal connectivity of the poset. This

parameter lies in interval(0, 1] and is the fraction of values in the immediate deeper level that a value dominates.

Having fixed the number of attributes and their domains for the documents, values were generated uniformly and

independently.

We restrict our synthetic data experiments to documents whose attribute values were generated independently.

To compensate, we experiment with real data that exhibit a high degree of non-uniformity and negative attribute


correlation.

(a) (b)

Figure 6.10: Poset structure.

6.5.3 Experimental Setting

We will refer to the adapted solution of [35] as SDC (Stratification by Dominance Classification). For fairness,

we used a lightweight grid to index the skybuffer instead of heavyweightR-tree, as was initially suggested in the

numerical skyline maintenance solution of [144]. This is supported by recent work that argues that a grid is a

more appropriate index than anR-tree in the context of streaming applications [48, 93, 115].

A stream of extended documents arrives continuously at the system which maintains a sliding window buffer.

The buffer stores then most recent documents that have arrived from the stream. When a new document arrives,

the oldest document in the buffer is removed to free up space for the incoming document. After each buffer

update, the skyline as well as all supporting indices are brought up to date.

The experimental setting parameters that can potentially affect performance are the size of the buffern, the

number of meta-data attributesm (also referred to as dimensionality) and the structure of the categorical domains.

As was described in Section 6.5.2, domains are characterized by their type (tree, wall) and a triplet(l, h, c).

Therefore, we designed a set of experiments to evaluate the impact of these parameters on the performance of the

two techniques. We also designed experiments that offer us insight into the inner workings of the techniques and

help us understand how different experimental and method parameters affect performance. The memory overhead

of our techniques are reasonable and within the capabilities of modern commodity hardware.

The techniques were implemented in Java and all experimentswere carried out in 2.4GHz Opteron 850 pro-

cessor with 4GB of memory, using the 32-bit Java 1.6 Server VM.

6.5.4 Experimental Results

Performance evaluation of STARS and SDC

The goal of our first set of experiments is to identify the impact of the buffer size, number of attributes and domain

structure on performance, as measured by the average time required to process a buffer update, i.e., a combined


document arrival and expiration. This includes the time required to update the skyline (if necessary) and all

supporting indexing structures.

Figures 6.11(a)-(c) present the average time required per buffer update in milliseconds, for buffer sizes ranging

from 10 thousand to 1 million documents and documents with 2,3 and 4 attributes. For all these experiments, the

domain of the attributes were trees with parameters(500, 8, 0.3).

Both SDC and STARS employ a grid to index the skybuffer. Therefore, the performance of both techniques

is affected by our choice of the skybuffer grid granularity.This is especially true for the case of SDC. As we

elaborated in Section 6.5.1, SDC maps documents withm categorical attributes to2m-dimensional numerical

tuples. Given the high dimensionality of the SDC grid, smallchanges in granularity can have huge impact on

performance, as the number and size of cells is affected in anexponential manner. Nevertheless, for each indi-

vidual experiment we used for SDC the granularity values that resulted in the best performance. Instead, for the

STARS technique we kept the grid granularity at 10 buckets per dimension. Later in the section, we vary the grid

granularity for STARS and observe performance trends.

10K 20K 50K 100K 200K 500K 1000K0

1

2

3

4

5

6

7

8

9

104d Data

Buffer size

Tim

e pe

r up

date

(m

s)

STARS

10K 20K 50K 100K 200K 500K 1000K0

0.02

0.04

0.06

0.08

0.12d Data

Buffer size

Tim

e pe

r up

date

(m

s)

10K 20K 50K 100K 200K 500K 1000K0

0.25

0.5

0.75

1

1.25

1.53d Data

Buffer size

Tim

e pe

r up

date

(m

s)

STARS

SDC

STARS

SDC

(a) (b)

(c)

Figure 6.11: Effect of buffer size and number of attributes on performance.

As it is obvious in Figures 6.11(a)-(c), STARS outperforms SDC by an order of magnitude. In Figure 6.11(c)

we completely omitted the SDC technique, since its performance for documents with four attributes deteriorated.


Furthermore, the time required by STARS in order to handle a buffer update is in the order of a millisecond or

less, thus rendering its use in real life applications entirely realistic. We will provide further evidence that support

this claim when we subsequently present our real data experiments.

Note that the performance trends in Figures 6.11(a)-(c) canbe non-monotone with respect to the size of the

buffer. This is to be expected, as there exist two competing trends that depend on the buffer size and affect

performance. For example, as the buffer size increases, we can expect both the size of the skybuffer and the

size of the skyline to increase. On the other hand, as the buffer size increases, the probability that the expiration

of a document will affect the skyline decreases and so does the probability that an expensive skyline mending

operation will have to be triggered. Remember that when a document belonging to the skyline expires, we need to

identify all the documents that it exclusively dominated and insert them in the skyline. The converse is true when

the buffer size decreases: the skyline size decreases whilethe invocations of reconstruction operations increase.

We performed additional experiments involving tree and wall domains with a wide range of parameters values

(l, h, c). In these experimental results we observed similar trends and performance differences between the two

techniques. The shape and size of the attribute domains influence performance mostly indirectly, by affecting the

average size of the skyline: domains that produce larger skylines are associated with higher buffer update cost.

Keeping two of the poset parameters(l, h, c) fixed, decreasingc (internal poset connectivity) increases the skyline

size, and so does increasingl (number of poset values) and decreasingh (the number of depth levels).

10K 20K 50K 100K 200K0

0.5

1

1.5

2

2.5

3DMV data set

Buffer Size

Tim

e pe

r up

date

(m

s)

STARS

Figure 6.12: Performance on skewed real data.

Besides synthetic data, we also employed a stream of real, skewed and correlated data for our performance

experiments. We used the categorical tuples of the DMV data set [80] as “documents”. They were associated with

three categorical attributes of the “cars” table: Maker/Model (38 possible values), Color (504 possible values)


and Year (74 possible values). The resulting documents haveattribute values that are skewed and correlated. The

degree of skew and correlation in this real data set is fixed. Also, since the categorical attributes are not associated

with a partial-order, we manually organized their values intree-structured posets (Figure 6.10(a)).

Figure 6.12 depicts the results of the experiment. For buffer sizes between 10K and 200K, the time required

by STARS to process a buffer update is in the order of a millisecond - an entirely realistic figure. SDC’s results are

omitted from the figure as it fared poorly: the update time ranged from 6ms for a 10K buffer to more than 100ms

for a large, 200K buffer. The figure demonstrates a clear upward trend in the update time for larger buffer sizes.

This can be attributed to the presence of skew and correlation in the attribute values that leads to considerably

larger skylines as the buffer size increases.

Our next experiment utilizes synthetic data in order to offer insight on the performance differential between

SDC and STARS. Figure 6.13 depicts the size of the skybuffer maintained by the techniques , for documents with

two (Figure 6.13(a)) and three (Figure 6.13(b)) attributes. The domains were trees with parameters(500, 8, 0.3).

10K 20K 50K 100K 200K 500K 1000K0.1K

1K

10K

100K

1000K2d Data

Buffer size

Sky

buff

er s

ize

10K 20K 50K 100K 200K 500K 1000K

1K

10K

100K

1000K

0.1K

3d Data

Buffer size

Sky

buff

er s

ize

STARS

SDC

STARS

SDC

(a) (b)

Figure 6.13: Size of the skybuffer as a function of the buffersize.

SDC employs a mapping of documents withm categorical attributes to2m-dimensional numerical tuples.

However, the mapping is not exact in the sense that dominancein the categorical space does not imply dominance

in the numerical space (Figure 6.9). The implication of thisrelation is that when a document is inserted in

the skybuffer, a rectangular range search fails to identifyall the skybuffer documents that are dominated by the

inserted document. Therefore, the skybuffer of SDC can contain many more documents that the skybuffer of

STARS, since documents that could have been removed, still reside in the skybuffer. This has a big impact when

SDC attempts to repair the skyline after a document expiration. Then, the additional documents in the skybuffer

that need to be examined become a huge burden.

Notice that both axes are in logarithmic scale. In the case ofdocuments with two attributes, the skybuffer size

for SDC is greater than STARS’s, yet it is still manageable. However, for documents with three attributes, SDC’s


skybuffer size explodes: as the number of attributes increases, the number of categorical dominance relations that

the mapping to the numerical domain fails to capture, increases.

Further evaluation of STARS

We also designed and performed experiments to further studythe STARS technique and its two components.

In particular, we performed experiments to quantify the pruning efficiency of the arrangement-based skyline

organization, as well as the gains that can be achieved by utilizing the proposed granularity setting mechanism in

conjunction with the poset partitioning technique.

2d 3d 4d0

0.05

0.1

0.15

0.2Tree Poset

Data dimensionality

Pru

ning

Eff

icie

ncy

2d 3d 4d0

0.05

0.1

0.15

0.2Wall Poset

Data dimensionality

Pru

ning

Eff

icie

ncy

(a) (b)

Figure 6.14: Pruning efficiency of arrangement skyline organization.

Figures 6.14(a) and 6.14(b) present results demonstratingthe pruning efficiency of the arrangement-based

skyline organization. A dominance query against the skyline determines whether a query document is dominated

by the skyline. The objective is to do so by checking as few skyline documents as possible. Therefore, pruning

efficiency is measured as the fraction of skyline documents that need to be examined on average in order to answer

a dominance query.

As it is evident in Figures 6.14(a) and 6.14(b), the arrangement organization is able to answer a dominance

query by considering on average about10% of the skyline documents. This is true for both tree and wall struc-

tured domains. For this experiment, we materialized the tree structured domains with parameters(500, 8, 0.3)

and the wall domains with parameters(250, 10, 0.3). Notice, that the pruning efficiency decreases as the dimen-

sionality increases, although not considerably. This is reasonable to expect, since the pruning technique utilizes

information from only two of the document’s attributes, therefore failing to exploit some pruning opportunities as

the dimensionality increases.

Our next experiment demonstrates the potential performance benefits by utilizing the techniques of Section

6.4.3, that allow us to set the skybuffer grid granularity. This involves partitioning the attribute domains in disjoint


value groups so that the desired grid granularity is matchedand the expected query time is minimized. For this

experiment, we measured the average time in milliseconds required to perform a skybuffer update, i.e., identify

all the documents in the skybuffer dominated by an incoming document, remove them and insert the incoming

document in the skybuffer.

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Grid granularity

Tim

e pe

r up

date

(m

s)

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Grid granularity

Tim

e pe

r up

date

(m

s)

Tree Poset Wall Poset

(a) (b)

Figure 6.15: Effect of grid granularity on performance.

Figure 6.15 depicts the time required to perform a skybufferupdate versus the grid granularity, for docu-

ments with three attributes. The grid granularity is measured as the number of allocated buckets per dimension.

More specifically, Figure 6.15(a) presents the results whenthe domains are tree structured posets with param-

eters(200, 4, 0.1), while Figure 6.15(b) the results when the domains are wall structure posets with parameters

(200, 4, 0.1).

The leftmost values in the plots correspond to a partitioning that allocates a single bucket per depth level. By

increasing the granularity we can achieve better performance. This performance increase would not be possible

without our poset partitioning heuristic: a bad partitioning strategy would result in performance inferior to the

bucket-per-depth-level, partitioning scheme. However, the poset partitioning technique allows us to translate

an increase in the grid granularity to increase in performance. The benefit of increasing the grid granularity is

eventually offset by the overhead of visiting many sparse cells. This additional overhead explains the knee in the

curves of Figure 6.15. Notice that even though performance can be poor for extreme granularity values, there is a

wide range of values that offer near optimal performance.

Summary

To summarize, our experimental evaluation demonstrated the applicability of the proposed solution to a wide

range of buffer sizes and dimensionality (number of attributes), for both synthetic (Figures 6.11(a)-(c)) and real

(Figure 6.12) data. We also verified our claim that the skybuffer indexing technique can adapt to posets of any


shape and size by offering flexibility in controlling the granularity of the grid-based indexing structure (Figure

6.15). The second claim that we verified was the pruning efficiency of the skyline organization, which was also

found to be resilient to increases in dimensionality (Figure 6.14). Lastly, we demonstrated the inapplicability of

the existing offline skyline evaluation techniques of [35] in a streaming environment (Figures 6.11(a)-(c)) and

identified the inherent reasons behind this poor performance (Figure 6.13).

6.6 Conclusions

In this Chapter, we identified and motivated the problem of maintaining the skyline of a dynamic collection of

extended documents associated with partially ordered categorical meta-data attributes, and realized two novel

techniques that constitute the building blocks of an efficient solution to the problem.

We introduced a lightweight data structure for indexing thedocuments in the streaming buffer, that can grace-

fully adapt to documents with many attributes and partiallyordered domains of any size and complexity. We

subsequently studied the dominance relation in the dual space and utilized geometric arrangements in order to

index the categorical skyline and efficiently evaluate dominance queries. Lastly, we performed a thorough exper-

imental study to evaluate the efficiency of the proposed techniques.

Chapter 7

Integrating Structured Data into Web

Search

7.1 Introduction

In Section 2.7 we reviewed how search engines are evolving from textual information retrieval systems to highly

sophisticated answering ecosystems utilizing information from multiple diverse sources. One such valuable

source of information is structured data, abstracted as relational tables, and readily available in publicly accessible

data repositories or proprietary databases.

Structured data can be used to better serve a large body of user queries that target information which does not

reside on a single or any Web page. Queries about products (e.g., “50 inch LG lcd tvs”, “orange fendi handbag”,

“white tiger book”), movie showtime listings (e.g., “indiana jones 4 near boston”), airline schedules (e.g., “flights

from boston to new york”), are only a few examples of queries that are better served by directly using information

from structured data. This information is presented prominently along with regular search results (Figure 7.1).

However, integrating in this manner structured data collections and the traditional Web page index poses the

following important challenges:

Web speed:Web users have become accustomed to lightning fast responses. Studies have shown that even sub-

second delays in returning search results cause dissatisfaction among Web users, resulting in query abandonment

and loss of revenue for search engines.

Web scale:Users issue over 100 million Web queries per day. Additionally, there is an abundance of structured

data [23] already available within a search engine’s ecosystem from sources like crawling, data feeds, business

deals or proprietary information. The combination of the two makes an efficient end-to-end solution non trivial.

148

CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 149

Figure 7.1: Integrating structured data into Web search

Free-text queries: Web users express queries in unstructured free-form text without knowledge of schema or

available databases. To produce meaningful results, querykeywords should be mapped to structure.

For example, consider the query “50 inch LG lcd tv” and assumethat there exists a table with information

on TVs. One way to handle such a query would be to treat each product as a bag of words and apply standard

Information Retrieval techniques. However, assume that LGdoesnotmake 50 inch lcd tvs – there is a 46 inch and

a 55 inch lcd tv model. Simple keyword search would retrieve nothing. On the other hand, consider a structured

query that targets table “TVs” and specifies attributesDiagonal = “50 inch”, Brand = “LG”, TV Type= “lcd

tv”. Now, the retrieval and ranking system can handle this query with a range predicate onDiagonaland a fast

selection on the other attributes.

Intent disambiguation:

Web users seek information and issue queries oblivious to the existence of structured data sources, let alone

their schema and their arrangement. A mechanism that directly maps keywords to structure can lead to misin-

terpretations of the user’s intent for a large class of queries. There are two possible types of misinterpretations:

between Web versus structured data, and between individualstructured tables.

For example, consider the query “white tiger” and assume there is a table available containing Shoes and one

containing Books. For “white tiger”, a potential mapping can beTable= “Shoes” and attributesColor = “white”

andShoe Line= “tiger”, after the popular Asics Tiger line. A different potential mapping can beTable= “Books”

andTitle = “white tiger”, after the popular book. Although both mappings are possible, it seems that the book is

more applicable in this scenario.

On the other hand, it is also quite possible the user was asking information that is not contained in our collec-

tion of available structured data, for example about “whitetiger”, the animal. In such case, presenting query results


with structured information about either books or shoes would be detrimental to user experience. Hence, although

multiple structured mappings can be feasible, it is important to determine which ones are at all meaningful. Such

information can greatly benefit overall result quality.

To address these challenges, we exploit latent structured semantics in Web queries to create mappings to

structured data tables and attributes. We call such mappingsStructured Annotations. For example an annotation

for the query “50 inch LG lcd tv” specifies theTable= “TVs” and the attributesDiagonal= “50 inch”, Brand=

“LG”, TV Type= “lcd tv”.

However, as we have already demonstrated with query “white tiger”, generating all possible annotations is

not sufficient. We need to estimate the plausibility of each annotation and determine the one that most likely

captures the intent of the user. To handle such problems we designed a principled probabilistic model that scores

each possible structured annotation. In addition, it also computes a score for the possibility of the query targeting

information outside the structured data collection. The latter score acts as a dynamic threshold mechanism used

to expose annotations that correspond to misinterpretations of the user intent.

The result is aQuery Annotatorcomponent, shown in Figure 7.2. It is worth clarifying that we are not

solving the end to end problem of including structured data in response to Web queries. That would include other

components such as indexing, data retrieval, ranking and presentation. OurQuery Annotatorcomponent sits on

the frond end of such end-to-end system. Its output can be utilized to route queries to appropriate tables and feed

annotation scores to a structured data ranker.

50" LG lcd Tagger

lcd

lcd

Scorer

Statistics

Candidate Annotations

A1:

A2:

A1: 0.92

Scored, Plausible

Annotations

Online

Offline

LearningQuery

LogDataData

Tables

LG50"

50" LG

Figure 7.2: Overview of Query Annotator

Our contributions with respect to the challenges of integrating structured data into Web search are as follows.

1. Web speed:We design an efficient tokenizer and tagger mechanism producing annotations in milliseconds.

2. Web scale:We map the problem to a decomposable closed world summary of the structured data that can

be done in parallel for each structured table.

3. Free-text queries: We define the novel notion of a Structured Annotation capturing structure from free


text. We show how to implement a process producing all annotations given a closed structured data world.

4. Intent disambiguation: We describe a scoring mechanism that sorts annotations based on plausibility.

Furthermore, we extend the scoring with a dynamic threshold, derived from the probability a query was not

described by our closed world.

Part of the work presented in this Chapter also appears in [134] and is organized as follows: We describe

the closed structured world andStructured Annotationsin Section 7.2. We discuss the efficient tokenizer and

tagger process that deterministically produces all annotations in Section 7.3. We define a principled probabilistic

generative model used for scoring the annotations in Section 7.4 and we discuss unsupervised model parameter

learning in Section 7.5. We performed a thorough experimental evaluation with promising results, presented

in Section 7.6. We conclude the chapter with a discussion of existing work in Section 7.7 and some closing

comments in Section 7.8.

7.2 Structured Annotations

We assume that our application maintains a Web page collection D and a collection of structured datatables

T = T1, T2, . . . , Tτ1. A tableT is a set of relatedentitiessharing a set ofattributes. We denote the attributes of

tableT asT.A = T.A1, T.A2, . . . , T.Aα. Attributes can be eithercategoricalor numerical. Thedomainof a

categorical attributeT.Ac ∈ T.Ac, i.e., the set of possible values thatT.Ac can take, is denoted withT.Ac.V . We

assume that each numerical attributeT.An ∈ T.An is associated with a singleunit U of measurement. Given a

set of unitsU we defineNum(U) to be the set of all tokens that consist of a numerical value followed by a unit in

U . Hence, thedomainof a numerical attributeT.An isNum(T.An.U) and thedomainof all numerical attributes

T.An in a table isNum(T.An.U).

An example of two tables is shown in Figure 7.3. The first tablecontains TVs and the second Monitors.

They both have three attributes: Type, Brand and Diagonal. Type and Brand are categorical, whereas Diagonal

is numerical. The domain of values for all categorical attributes for both tables isT .Ac.V = TV, Samsung,

Sony, LG, Monitor, Dell, HP. The domain for the numerical attributes for both tables isNum(T .An.U) =

Num(inch). Note thatNum(inch) does not include only the values that appear in the tables of the example,

but rather all possible numbers followed by the unit “inch”.Additionally, note that it is possible to extend the

domains with synonyms, e.g., by using “in” for “inches” and “Hewlett Packard” for “HP”. Discovery of synonyms

is beyond the scope of this chapter, but existing techniques[112] can be leveraged.

We now give the following definitions.

1The organization of data into tables is purely conceptual and orthogonal to the underlying storage layer: the data can bephysically stored in XML files,relational tables, retrieved from remote Web services, etc. Our assumption is that a mapping between the storage layer and the “schema” of table collectionT hasbeen defined.


Definition 7.1 (Token). A tokenis defined as a sequence of characters including space, i.e.,one or more words.

For example, the bigram “digital camera” may be a single token.

Definition 7.2 (Open Language Model). We define theOpen Language Model(OLM) as the infinite set of all

possible tokens. All keyword Web queries can be expressed using tokens fromOLM.

Definition 7.3 (Typed Token). A typed tokent for tableT is any value from thedomainofT.Ac.V ∪Num(T.An.U).

Definition 7.4 (Closed Language Model). TheClosed Language ModelCLM of tableT is the set of all duplicate-

free typed tokens for tableT .

For the rest of the chapter, for simplicity, we often refer totyped tokensas justtokens. The closed language

modelCLM(T ) contains the duplicate-free set of all tokens associated with tableT . Since for numerical attributes

we only store the “units” associated withNum(U) the representation ofCLM(T ) is very compact.

The closed language modelCLM(T ) for all our structured dataT is defined as the union of the closed lan-

guage models of all tables. Furthermore, by definition, if webreak a collection of tablesT into k sub-collections

T1, ..., Tk, thenCLM(T ) can be decomposed intoCLM(T1), ...,CLM(Tk). In practice,CLM(T ) is used

to identify tokens in a query that appear in the tables of our collection. So compactness and decomposability are

very important features that address the Web speed and Web scale challenges.

The closed language model defines the set of tokens that are associated with a collection of tables, but it does

not assign anysemanticsto these tokens. To this end, we define the notion of anannotated tokenandclosed

structured model.

Definition 7.5 (Annotated Token). An annotated tokenfor a tableT is a pair AT = (t, T.A) of a tokent ∈

CLM(T ) and an attributeT.A of tableT , such thatt ∈ T.A.V .

For an annotated tokenAT = (t, T.A), we useAT.t to refer to underlying tokent. Similarly, we useAT.T

andAT.A to refer to the underlying tableT and attributeA. Intuitively, theannotated tokenAT assigns structured

semantics to a token. In the example of Figure 7.3, the annotated token (LG, TVs.Brand) denotes that the token

“LG” is a possible value for the attribute TVs.Brand.

Definition 7.6 (Closed Structured Model). TheClosed Structured Modelof tableT ,CSM(T ) ⊆ CLM(T )×T.A,

is the set of all annotated tokens for tableT .

Note that in the example of Figure 7.3, the annotated token (LG, TVs.Brand) forCSM(TVs) is different from

the annotated token (LG, Monitors.Brand) forCSM(Monitors), despite the fact that in both cases the name of

the attribute is the same, and the token “LG” appears in the closed language model of both TVs and Monitors

table. Furthermore, the annotated tokens (50 inch, TVs.Diagonal) and (15 inch, TVs.Diagonal) are part of for

CSM(TVs), despite the fact that table TVs does not contain entries with those values.


The closed structured model for the collectionT is defined as the union of the structured models for the tables

in T . In practice,CSM(T ) is used to map all recognized tokenst1, ..., tn from a queryq to tables and attributes

T1.A1, ..., Tn.An. This is a fast lookup process as annotated tokens can be keptin a hash table. To keep a small

memory footprint,CSM(T ) can be implemented using token pointers toCLM(T ), so the actual values are not

replicated. As before withCLM, CSM(T ) is decomposable to smaller collections of tables. Fast lookup, small

memory footprint and decomposability help with Web speed and Web scale requirements of our approach.

We are now ready to proceed with the definition of aStructured Annotation. But first, we introduce an

auxiliary notion that simplifies the definition. For a queryq, we define asegmentationof q, as the set of tokens

G = t1, ..., tk for which there is a permutationπ, such thatq = tπ(1), ..., tπ(k), i.e., the queryq is the sequence

of the tokens inG. Intuitively, a segmentation of a query is a sequence of non-overlapping tokens that cover the

entire query.

TVs

Type Brand Diagonal

TV Samsung 46 inch

TV Sony 60 inch

TV LG 26 inch

Monitors

Type Brand Diagonal

Monitor Samsung 24 inch

Monitor Dell 12 inch

Monitor HP 32 inch

Figure 7.3: A two-table example

Definition 7.7 (Structured Annotation). A structured annotationSq of queryq over a table collectionT , is

a triplet 〈T , AT , FT 〉, whereT denotes a table inT , AT ⊆ CSM(T ) is a set of annotated tokens, and

FT ⊆ OLM is a set of words such thatAT .t,FT is a segmentation ofq.

A structured annotation2 Sq = 〈T,AT ,FT 〉 of queryq is a mapping of the user-issued keyword query to

a structured data tableT , a subset of its attributesAT .A, and a set offree tokensFT of words from the open

language model. Intuitively, it corresponds to an interpretation of the query as a request for some entities from

tableT . The set of annotated tokensAT expresses the characteristics ofT ’s entities requested, as pairs(ti, T.Ai)

of a table attributeT.Ai and a specific attribute valueti. The set of free tokensFT is the portion of the query that

cannot be associated with an attribute of tableT . Annotated and free tokens together cover all the words in the

query, defining complete segmentation ofq.

One could argue that it is possible for a query to target more than one table and the definition of a structured

annotation does not cover this case. For example, query “chinese restaurants in san francisco” could refer to

a table of Restaurants and one of Locations. We could extend our model and annotation definitions to support

2For convenience we will often use the termsannotationto refer to a structured annotation.


multiple tables, but for simplicity we choose to not to, since the single-table problem is already a complex one.

Instead, we assume that such tables have been joined into a virtual one.

Now, consider the keyword queryq =“50 inch LG lcd”. Assume that we have a collectionT of three tables

over TVs, Monitors, and Refrigerators, and there are three possible annotations〈T,AT ,FT 〉 of q (shown in

Figure 7.4(a-c)):

(a) S1 = 〈TVs, (50 inch, TVs.Diagonal), (LG, TVs.Brand), (lcd, TVs.Screen), 〉

(b) S2 = 〈Monitors, (50 inch, Monitors.Diagonal), (LG, Monitors.Brand), (lcd, Monitors.Screen), 〉

(c) S3 = 〈Refrigerators, (50 inch, Refrigerators.Width), (LG, Refrigerators.Brand), lcd〉

The example above highlights the challenges discussed in Section 7.1. The first challenge is how to effi-

ciently derive all possible annotations. As the size and heterogeneity of the underlying structured data collection

increases, so does the number of possible structured annotations per query. For instance, there can be multiple

product categories manufactured by “LG” or have an attribute measured in “inches”. This would result in an even

higher number of structured annotations for the example query q =“50 inch LG lcd”. Hence, efficient generation

of all structured annotations of a query is a highly challenging problem.

!

50 inch LG lcd

"#$

%& '' !

50 inch lcd

() $

%& '' *!)+ !

50 inch lcd

,'- ' ) $

./00

(a) (b) (c)

50 inch LG lcd tv

123 &+4 "#$56

123 &+4 () $56

1784"#$5 !6

1784() $5 !6

1)94 "#$5":;'6

(d)LG LG

Figure 7.4: Examples of annotations and annotation generation

Problem 7.1 (Annotation Generation). Given a keyword queryq, generate the set of allstructured annotations

Sq = S1, . . . , Sk of queryq.

Second, it should be clear from our previous example that although many structured annotations are possible,

only a handful, if any, areplausibleinterpretations of the keyword query. For instance, annotation S1 (Figure

7.4(a)) is a perfectly sensible interpretation ofq. This is not true for annotationsS2 andS3. S2 maps theentire

keyword query to table Monitors, but it is highly unlikely that a user would request Monitors with such character-

istics, i.e.,(50 inch, Monitors.Diagonal), as users are aware that no such large monitors exist (yet?).Annotation

S3 maps the query to table Refrigerators. A request for Refrigerators made by LG and a Width of 50 inches is

sensible, but it is extremely unlikely that a keyword query expressing this request would include free token “lcd”,

which is irrelevant to Refrigerators. Note that the existence of free tokens does not necessarily make an annotation


implausible. For example, for the query “50 inch lcdscreenLG”, the free token “screen” increases the plausibil-

ity of the annotation that maps the query to the table TVs. Such subtleties demand a robust scoring mechanism,

capable of eliminating implausible annotations and distinguishing between the (potentially many) plausible ones.

Problem 7.2(Annotation Scoring). Given a set of candidate annotationsSq = S1, . . . , Sk for a queryq, define

a scoref(Si) for each annotationSi, and determine theplausibleones satisfyingf(Si) > θq, whereθq is a

query-specific threshold.

It can be argued that the scale of the two problems could be reduced by requiring that an annotation contains

no free tokens. There are two answers to this concern. First,as our previous example illustrates, even annotations

containing no free tokens can be implausible. Second, this strict requirement severely limits the utility of the

Query Annotator. As discussed, user-issued keyword queries are free-form text, oblivious to the underlying

structured data collection and its schema. Many annotations containing free tokens can be plausible and should

not be rejected without consideration, e.g., queries such as “50 inch lcdby LG” and “50 inch lcdscreen LG” and

structured annotations mapping them to table TVs.

Addressing the two problems represents the first step towards integrating structured data sources into Web

search. The plausible annotations identified for queryq can be further used to retrieve, rank and embed structured

data into regular Web search results from Web page collectionD. For queries with no plausible annotations, only

Web pages fromD are relevant. This process need not affect the way textual data fromD are retrieved and ranked

[106].

We address the Annotation Generation problem in Section 7.3, and the Annotation Scoring problem in Sec-

tions 7.4 and 7.5.

7.3 Producing Annotations

The process through which a Web queryq is mapped toStructured Annotationsinvolves two functions: a tokenizer

fTOK and an taggerfTAG. The tokenizer maps queryq to a set of annotated tokensAT q ⊆ CSM(T ) from the

set of all possible annotated tokens in the closed structured model of the dataset. The tagger consumes the query

q and the set of annotated tokensATq and produces a set of structured annotationsSq.

Tokenizer: The tokenizer procedure is shown in Algorithm 7.1. The tokenizer consumes one query and pro-

duces all possible annotated tokens. For example, considerthe query “50 inch LG lcd tv” and suppose we use

the tokenizer over the dataset in Figure 7.4. Then the outputof the tokenizer will befTOK(q) =(50 inch,

TVs.Diagonal), (50 inch, Monitors.Diagonal), (LG, Monitors. Brand), (LG, TVs.Brand), (tv, TVs.Type) (Figure

7.4(d)). The token “lcd” will be left unmapped, since it doesnot belong to the language modelCLM(T ).


In order to impose minimal computational overhead when parsing queries, the tokenizer utilizes a highly

efficient and compact string dictionary, implemented as a Ternary Search Tree (TST) [22]. The main-memory

TST is a specialized key-value dictionary with well understood performance benefits. For a collection of tables

T , the Ternary Search Tree is loaded with the duplicate free values of categorical attributes and list of units of

numerical attributes. So semantically TST storesT .Ac.V ∪ T .An.U .

For numbers, a regular expressions matching algorithm is used to scan the keyword query and make a note of

all potential numeric expressions. Subsequently, terms adjacent to a number are looked-up in the ternary search

tree in order to determine whether they correspond to a relevantunit of measurement, e.g., “inch”, “GB”, etc. If

that is the case, the number along with the unit-term are grouped together to form a typed token.

For every parsed typed tokent, the TST stores pointers to all the attributes, over all tables and attributes in the

collection that contain this token as a value. We thus obtainthe set of all annotated tokensAT that involve token

t. The tokenizer maps the queryq to the closed structured modelCSM(T ) of the collection. Furthermore, it also

outputs a free token for every word in the query. Therefore, we have thatfTOK(q) = AT q,FT q, whereAT q

is the set of all possible annotated tokens inq over all tables, andFT q is the set of words inq, as free tokens.

Algorithm 7.1 TokenizerInput: A queryq represented as an array of wordsq[1, . . . , length(q)]

Output: An arrayAT , such that for each positioni of q,AT [i] is the list of annotated tokens beginning ati; A list of free

tokensFT .

for i = 1 . . . length(q) do

Compute the set of annotated tokensAT [i] starting at positioni of the query.

Add wordq[i] to the list of free tokensFT .

end for

return Array of annotated tokensAT and free tokensFT .

Tagger: We now describe how the tagger works. For that we need to first define the notion ofmaximal annotation.

Definition 7.8. Given a queryq, and the set of all possible annotationsSq of queryq, annotationSq = 〈T,AT ,FT 〉 ∈

Sq is maximal, if there exists no annotationS′q = 〈T ′,AT ′,FT ′〉 ∈ Sq such thatT = T ′ andAT ⊂ AT ′ and

FT ⊃ FT ′.

ThetaggerfTAG is a function that takes as input the set of annotated and freetokensAT q,FT q of query

q and outputs the set of all maximal annotationsfTOK(AT q, FT q) = S∗q . The procedure of the tagger is

shown in Algorithms 7.2 and 7.3. The algorithm first partitions the annotated tokens per table, decomposing the

problem to smaller subproblems. Then, for each table it constructs the candidate annotations by scanning the


query from left to right, each time appending an annotated orfree token to the end on an existing annotation, and

then recursing on the remaining uncovered query. This process produces all valid annotations. We perform a final

step to remove the non-maximal annotations. This can be doneefficiently in a single pass: each annotation needs

to be checked against the “current” set of maximal annotations, as in skyline computations.

As a walk through example consider the query “50 inch LG lcd tv”, over the data in Figure 7.3. The input to

the tagger is the set of all annotated tokensAT q computed by the tokenizer (together with the words of the query

as free tokens). This set is depicted in Figure 7.4(d). A subset of possible annotations forq is:

S1 = 〈TVs, (50 inch, TVs.Diagonal), LG, lcd, tv〉

S2 = 〈TVs, (50 inch, TVs.Diagonal), (LG, TVs.Brand), lcd, tv〉

S3 = 〈TVs, (50 inch, TVs.Diagonal), (LG, TVs.Brand), (tv, TVs.Type), lcd〉

S4 = 〈Monitors,(50 inch, Monitors.Diagonal), LG, lcd, tv〉

S5 = 〈Monitors,(50 inch, Monitors.Diagonal), (LG, Monitors.Brand), lcd, tv〉

Out of these annotations,S3 andS5 are maximal, and they are returned by the tagger function. Note that the

token “lcd” is always in the free token set, while “tv” is a free token only for Monitors.

Algorithm 7.2 TaggerInput: An arrayAT , such that for each positioni of q, AT [i] is the list of annotated tokens beginning ati; A list of free

tokensFT .

Output: A set of structured annotationsS

Partition the lists of annotated tokens per table.

for each tableT do

L = ComputeAnnotations(AT T ,FT , 0)

Eliminate non-maximal annotations fromL

S = S ∪ L

end for

return S

7.4 Scoring Annotations

For each keyword queryq, the tagger produces the list of all possible structured annotationsSq = S1, ..., Sk of

queryq. This set can be large, since query tokens can match the attribute domains of multiple tables. However, it

is usually quite unlikely that the query was actually intended for all these tables. For example, consider the query

“LG 30 inch screen”. Intuitively, the query most likely targets TVs or Monitors, however a structured annotation


Algorithm 7.3 ComputeAnnotationInput: An arrayAT , such thatAT [i] is the list of annotated tokens; A list of free tokensFT ; A positionk in the array

AT .

Output: A set of structured annotationsS using annotated and free tokens fromAT [j], FT [j] for j ≥ k.

if k > length(AT ) then

return ∅

end if

InitializeS = ∅

for each annotated or free tokenAFT ∈ (AT [k] ∪ FT [k]) do

k′ = k + length(AFT.t)

L = ComputeAnnotation(AT ,FT , k′)

for each annotationS in L do

S = AFT, S

S = S ∪ S

end for

end for

return S

will be generated for all tables that contain any product of LG (DVD players, cell phones, cameras, etc.), as well

as all tables with attributes measured in inches.

It is thus clear that there is a need for computing ascore for the annotations generated by the tagger that

captures how “likely” an annotation is. This is the responsibility of the scorer function, which given the set of all

annotationsSq, it outputs for each annotationSi ∈ Sq the probabilityP (Si) of a user requesting the information

captured by the annotation. For example, it is unlikely thatquery “LG 30 inch screen”, targets a DVD player,

since most of the times people do not query for the dimensionsof a DVD player and DVD players do not have

a screen. It is also highly unlikely that the query refers to acamera or a cell phone, since although these devices

have a screen, its size is significantly smaller.

We model this intuition using agenerative probabilistic model. Our model assumes that users “generate” an

annotationSi (and the resulting keyword query) as a two step process. First, with probabilityP (T.Ai), they

decide on the tableT and the subset of its attributesT.Ai that they want to query, e.g., the product type and the

attributes of the product. Since the user may also include free tokens in the query, we extend the set of attributes

of each tableT with an additional attributeT.f that emits free tokens, and which may be included in the set of

attributesT.Ai. For clarity, we useT.Ai to denote a subset of attributes taken over this extended setof attributes,

while T.Ai to denote the subset of attributes from the tableT . Note that similar to every other attribute of table


T , the free-token attributeT.f can be repeated multiple times, depending on the number of free tokens added to

the query.

In the second step, given their previous choice of attributes T.Ai, users select specific annotated and free

tokens with probability

P (AT i,FT i|T.Ai). Combining the two steps, we have:

P (Si) = P (AT i,FT i|T.Ai)× P (T.Ai) (7.1)

For the “LG 30 inch screen” example, letSi = 〈 TVs,(LG, TVs.Brand), (30 inch, TVs.Diagonal),screen〉

be an annotation over the table TVs. Here the set of selected attributes isTVs.Brand, TVs.Diagonal, TVs.f.

We thus have:

P (Si) = P (LG, 30 inch,screen|Brand,Diagonal, f)× P (TVs.Brand,TVs.Diagonal,TVs.f) (7.2)

In order to facilitate the evaluation of Equation 7.1 we makesome simplifying but reasonable assumptions.

First, that the sets of annotatedAT i and freeFT i tokens are independent, conditional on the set of attributes

T.Ai selected by the user, that is:

P (AT i,FT i|T.Ai) = P (AT i|T.Ai)× P (FT i|T.Ai)

Second, we assume that the free tokensFT i do not depend on the exact attributesT.Ai selected by the user, but

only on the tableT that the user decided to query. That is,P (FT i|T.Ai) = P (FT i|T ). For example, the fact

that the user decided to add the free token “screen” to the query depends only on the fact that she decided to query

the table TVs, and not on the specific attributes of the TVs table that she decided to query.

Lastly, we also assume that the annotated tokensAT i selected by a user do not depend on her decision to

add a free token to the query, but instead only on the attributesT.Ai of the table that she queried. That is,

P (AT i|T.Ai) = P (AT i|T.Ai). In our running example, this means that the fact that the user queried for the

brand “LG”, and the diagonal value “30 inches”, does not depend on the decision to add a free token to the query.

Putting everything together, we can rewrite Equation 7.1 asfollows:

P (Si) = P (AT i|T.Ai)× P (FT i|T )× P (T.Ai) (7.3)

Given the annotation setSq = S1, ..., Sk of queryq, the scorer function uses Equation 7.3 to compute the

probability of each annotation. In Section 7.5 we describe how given an annotationSi we obtain estimates for the

probabilities involved in Equation 7.3. The probabilitiesallow us to discriminate between less and more likely

annotations. However, this implicitly assumes that we operate under a closed world hypothesis, where all of our


queries are targeting some table in the structured data collectionT . This assumption is incompatible with our

problem setting where users issue queries through a Web search engine text-box and are thus likely to compose

Web queries using an open language model targeting information outsideT , i.e., the Web page collectionD. For

example, the query “green apple” is a fully annotated query,where token “green” corresponds to a Color, and

“apple” to a Brand. However, it seems more likely that this query refers to the fruit, than any of the products of

Apple. We thus need to account for the case that the query we are annotating is a regular Web query not targeting

the structured data collection.

Our generative model can easily incorporate this possibility in a consistent manner. We define the open-

language “table”OLM which is meant to capture open-world queries. TheOLM table has only the free-token

attributeOLM.f and generates all possible free-text queries. We populate the table using a generic Web query

log. LetFT q denote the free-token representation of a queryq. We generate an additional annotationSOLM =

〈OLM, FT q〉, and we evaluate it together with all the other annotations in Sq. Thus the set of annotations

becomesSq = S1, ..., Sk, Sk+1, whereSk+1 = SOLM, and we have:

P (SOLM) = P (FT q|OLM)× P (OLM) (7.4)

TheSOLM annotation serves as a “control” against which all candidate structure annotations need to measured.

The probabilityP (SOLM) acts as an adaptive threshold which can be used to filter outimplausibleannotations,

whose probability is not high enough compared toP (SOLM). More specifically, for someθ > 0, we say that

a structured annotationSi is plausibleif P (Si)P (SOLM) > θ. In other words, an annotation, which corresponds to an

interpretation of the query as a request which can be satisfied using structured data, is considered plausible if it is

more probablethan the open-language annotation, which captures the absence of demand for structured data. On

the other hand, implausible annotations areless probablethan the open-language annotation, which suggests that

they correspond to misinterpretations of the keyword query.

P(S1)

P(S2)

θ*P(SOLM)

P(S3)

P(S4)

Select table T and attributes T.Ai with

P(T.Ãi)

Select annotated and free tokens with P( ATi,FTi|T.Ãi)

Generate an OLMquery with P(OLM)

Start

(a) (b)

Select query q with P(FTq|OLM)

Figure 7.5: The scorer component


The value ofθ is used to control the strictness of the plausibility condition. The scorer outputs only the set of

plausible structured annotations (Figure 7.5(a)). Noticethat multiple plausible annotations are both possible and

desirable. Certain queries are naturally ambiguous, in which case it is sensible to output more than one plausible

annotations. For example, the query “LG 30 inch screen” can be targeting either TVs or Monitors.

7.5 Learning the Generative Model

In order to fully specify the generative model described in Section 7.4 and summarized in Figure 7.5(b), we need

to describe how to obtain estimates for the probabilitiesP (AT i|T.Ai), P (FT i|T ), andP (T.Ai) in Equation 7.3

for every annotationSi in Sq, as well asP (FT q|OLM) andP (OLM) in Equation 7.4 for the open language

annotationSOLM. In order to guarantee highly efficient annotation scoring,these estimates need to be pre-

computed off-line, while to guarantee scoring precision, the estimates need also be accurate.

7.5.1 Estimating Token-generation Probabilities

Generating annotated tokens

We need to compute the conditional probabilityP (AT i|T.Ai), that is, the probability that the queryq on table

T and attributesT.Ai contains a specific combination of values for the attributes. A reasonable estimate of the

conditional probability is offered by the fraction of tableentries that actually contain the values that appear in the

annotated query. LetAT i.V denote the set of attribute values associated with annotated tokensAT i. Also, let

T (AT i.V) denote the set of entries inT where the attributes inT.Ai take the combination of valuesAT i.V . We

have:

P (AT i|T.Ai) =|T (AT i.V)|

|T |

For example, consider the query “50 inch LG lcd”, and the annotationS = 〈 TVs, (LG,TVs.Brand),(50 inch,

TVs.Diagonal),lcd〉. We haveT.A = Brand,Diagonal andAT .V = LG, 50inch. The setT (AT .V) is

the set of all televisions in the TVs table of brand LG with diagonal size 50 inch, andP (AT |T.A) is the fraction

of the entries in the TVs table that take these values.

Essentially, our implicit assumption behind this estimateis that attribute values appearing in annotated

queries and attribute values in tables follow the same distribution. For example, if a significant number of entries

in the TVs table contains brand LG, this is due to the fact thatLG is popular among customers. On the other hand,

only a tiny fraction of products are of the relatively obscure and, hence, infrequently queried brand “August”.

Similarly, we can expect few queries for “100 inch” TVs and more for “50 inch” TVs. That is, large TVs

represent a niche, and this is also reflected in the composition of table TVs. Additionally, we can expect practically


no queries for “200 inch” TVs, as people are aware that no suchlarge screens exist (yet?). On the other hand,

even if there are no TVs of size 33 inches in the database, but TVs of size 32 inches and 34 inches do exist, this is

an indication that 33 may be a reasonable size to appear in a query.

Of course, there is no need to actually issue the query over our data tables and retrieve its results in order to de-

termine conditional probabilityP (AT |T.A). Appropriate, lightweight statistics can be maintained and used, and

the vast literature onhistogram construction[81] andselectivity estimation[107] can be leveraged for this purpose.

In this work, we assume by default independence between the different attributes. IfT.A = T.A1, ..., T.Aa are

the attributes that appear in the annotation of the query, andAT = (T.A1.v, T.A1), ..., (T.Aa.v, T.Aa) are the

annotated tokens, then we have:

P (AT |T.A) =

a∏

j=1

P (T.Aj.v|T.Aj)

For the estimation ofP (T.Aj .v|T.Aj), for categorical attributes, we maintain the fraction of table entries

matching each domain value. For numerical attributes, a histogram is built instead, which is used as an estimate

of the probability density function of the values for this attribute. In that case, the probability of a numerical

attribute valuev is computed as the fraction of entities with values in range[(1− ǫ)v, (1 + ǫ)v] (we setǫ = 0.05

in our implementation). The resulting data structures storing these statistics are extremely compact and amenable

to efficient querying.

In the computation ofP (AT |T.A), we can leverage information we have about synonyms or common mis-

spellings of attribute values. Computation of the fractionof entries in tableT that contain a specific valuev for

attributeA, is done by counting how many timesv appears in the tableT for attributeA. Suppose that our query

contains valuev′, which we know to be a synonym of valuev, with some confidencep. The closed world language

model forT will be extended to includev′ with the added information that this maps to valuev with confidence

p. Then, estimating the probability of valuev′ can be done by counting the number of times valuev appears, and

weight this count by the value ofp.

Finally, we note that although in general we assume independence between attributes, multi-attribute statistics

are used whenever their absence could severely distort the selectivity estimates derived. Such an example are

attributes Brand and Model-Line. A Model-Line value is completely dependent on the corresponding Brand value.

Assuming independence between these two attributes would greatly underestimate the probability of relevant

value pairs.

Generating free tokens

We distinguish between two types of free tokens: the free tokens inFT q that are generated as part of the open

language model annotationSOLM that generates free-text Web queries, and free tokens inFT i that are generated


as part of an annotationSi for a tableT in the collectionT .

For the first type of free tokens, we compute the conditional probabilityP (FT q|OLM) using a simple uni-

gram model constructed from a collection of generic Web queries. The assumption is that each free token (word

in this case) is drawn independently. Therefore, we have that:

P (FT q|OLM) =∏

w∈FT q

P (w|OLM)

Obviously, the unigram model is not very sophisticated and is bound to offer less than perfect estimates.

However, recall that theOLM table is introduced to act as a “control” against which all candidate structured

annotations need to “compete”, in addition to each other, todetermine which ones are plausible annotations of

the query under consideration. An annotationSi is plausible ifP (Si) > θP (SOLM); the remaining annotations

are rejected. A rejected annotationSi is less likely to have generated the queryq, than a process that generates

queries by drawing words independently at random, according to their relative frequency. It is reasonable to argue

that such an interpretation of the queryq is implausible and should be rejected.

For the second type of free tokens, we compute the conditional probabilityP (FT i|T ), for some annotation

Si over tableT , using again a unigram modelUMT that is specific to the tableT , and contains all unigrams that

can be associated with tableT . For construction ofUMT , we utilize the names and values ofall attributes of table

T . Such words are highly relevant to tableT and therefore have a higher chance of being included as free tokens

in an annotated query targeted at tableT . Further extensions of the unigram model are possible, by including

other information related to tableT , e.g., crawling related information from the Web, or addingrelated queries

via toolbar or query log analysis.

Using the unigram modelUMT we now have:

P (FT i|T ) =∏

w∈FT i

P (w|T ) =∏

w∈FT i

P (w|UMT )

Note that free tokens are important for disambiguating the intent of the user. For example, for the query “LG

30 inch computer screen” there are two possible annotations, one for the Monitors table, and one for the TV

table, each one selecting the attributes Brand and Diagonal. The terms “computer” and “screen” are free tokens.

In this case the selected attributes should not give a clear preference of one table over the other, but the free

term “computer” should assign more probability to the Monitors table, over the TVs table, since it is related to

Monitors, and not to TVs.

Given that we are dealing with Web queries, it is likely that users may also use as free tokens words that are

generic to Web queries, even for queries that target a very specific table in the structured data. Therefore, when

computing the probability that a word appears as a free tokenin an annotation we should also take into account

the likelihood of a word to appear in a generic Web query. For this purpose, we use the unigram open language


modelOLM described in Section 7.4 as thebackgroundprobability of a free tokenw in FT i, and we interpolate

the conditional probabilitiesP (w|UMT ) andP (w|OLM). Putting everything together:

P (w|T ) = λP (w|UMT ) + µP (w|OLM) , λ+ µ = 1 (7.5)

The ratio betweenλ/µ controls the confidence we place to the unigram model, versusthe possibility that the

free tokens come from the background distribution. Given the importance and potentially deleterious effect of

free tokens on the probability and plausibility of an annotation, we would like to exert additional control on how

free tokens affect the overall probability of an annotation. In order to do so, we introduce a tuning parameter

0 < φ ≤ 1, which can be used to additionally “penalize” the presence of free tokens in an annotation. To this end,

we compute:

P (w|T ) = φ(λP (w|UMT ) + µP (w|OLM))

Intuitively, we can viewφ as the effect of a process that outputs free tokens with probability zero (or asymptotically

close to zero), which is activated with probability1 − φ. We set the ratioλ/µ and penalty parameterφ in our

experimental evaluation in Section 7.6.

7.5.2 Estimating Template Probabilities

We now focus on estimating the probability of a query targeting particular tables and attributes, i.e., estimate

P (T.Ai) for an annotationSi. A parallel challenge is the estimation ofP (OLM), i.e., the probability of a query

being generated by the open language model, since this is considered as an additional type of “table” with a single

attribute that generates free tokens. We will refer to tableand attribute combinations asattribute templates.

The most reasonable source of information for estimating these probabilities is Web query log data, i.e., user-

issued Web queries that have been already witnessed. LetQ be a such collection of witnessed Web queries. Based

on our assumptions, these queries are the output of|Q| “runs” of the generative process depicted in Figure 7.5(b).

The unknown parameters of a probabilistic generative process are typically computed usingmaximum likelihood

estimation, that is, estimating attribute template probability values P (T.Ai) andP (OLM) that maximize the

likelihood of the generative process giving birth to query collectionQ.

Consider a keyword queryq ∈ Q and its annotationsSq. The query can either be the formulation of a request

for structured data captured by an annotationSi ∈ Sq, or free-text query described by theSOLM annotation. Since

these possibilities are disjoint, the probability of the generative processes outputting queryq is:

P (q) =∑

Si∈Sq

×P (Si) + P (SOLM) =

=∑

Si∈Sq

P (AT i,FT i|T.Ai)× P (T.Ai) + P (FT q|OLM)P (OLM)


A more general way of expressingP (q) is by assuming that all tables in the database and all possible combina-

tions of attributes from these tables could give birth to query q and, hence, contribute to probabilityP (q). The

combinations that do not appear in annotation setSq will have zero contribution. Formally, letTi be a table, and

let Pi denote the set of all possible combinations of attributes ofTi, including the free token emitting attribute

Ti.f . Then, for a table collectionT of size|T |, we can write:

P (q) =

|T |∑

i=1

∑

Aj∈Pi

αqijπij + βqπo

whereαqij = P (AT ij ,FT ij|Ti.Aj), βq = P (FT q|OLM), πij = P (Ti.Aj) andπo = P (OLM). Note that

for annotationsSij 6∈ Sq, we haveaqij = 0. For a given queryq, the parametersαqij andβq can be computed as

described in Section 7.5.1. The parametersπij andπo correspond to the unknown attribute template probabilities

we need to estimate.

Therefore, the log-likelihood of the entire query log can beexpressed as follows:

L(Q) =∑

q∈Q

logP (q) =∑

q∈Q

log

|T |∑

i=1

∑

Aj∈Pi

αqijπij + βqπo

Maximization ofL(Q) results in the following problem:

maxπij ,πo

L(Q), subject to∑

ij

πij + πo = 1 (7.6)

Condition∑

ij πij +πo = 1 follows from the fact that based on our generative model all queries can be explained

either by an annotation over the structured data tables, or as free-text queries generated by the open-wold language

model.

This is a large optimization problem with millions of variables. Fortunately, objective functionL(πij , πo|Q)

is concave. This follows from the fact that the logarithms oflinear functions are concave, and the composition of

concave functions remains concave. Therefore, any optimization algorithm will converge to a global maximum.

A simple, efficient optimization algorithm is the Expectation-Maximization (EM) algorithm [24].

Lemma 7.1. The constrained optimization problem described by equations 7.6 can be solved using the Expectation-

Maximization algorithm. For every query keyword queryq and variableπij , we introduce auxiliary variablesγqij

andδq. The algorithm’s iterations are provided by the following formulas:

• E-Step:γt+1qij = αqijπ

tij/ (

∑

km αqkmπtkm + βqπ

to)

δt+1q = βqπ

to/ (

∑

km αqkmπtkm + βqπ

to)

• M-Step:πt+1ij =

∑

q γt+1qij /|Q|

πt+1o =

∑

q δt+1q /|Q|


Proof. For a related proof, see [24].

The EM algorithm’s iterations are extremely lightweight and progressively improve the estimates for variables

πij , πo.

More intuitively, the algorithm works as follows. The E-step, uses the current estimates ofπij , πo to compute

for each queryq probabilitiesP (Sij), Sij ∈ Sq andP (SOLM). Note that for a given query we only consider

annotations in setSq. The appearance of each queryq is “attributed” among annotationsSij ∈ Sq andSOLM

proportionally to their probabilities, i.e.,γqij stands for the “fraction” of queryq resulting from annotationSij in-

volving tableTi and attributesTi.Aj . The M-step then estimatesπij = P (Ti.Aj) as the sum of query “fractions”

associated with tableTi and attribute setTi.Aj , over the total number of queries inQ.


We implemented our proposed Query Annotator solution in C#.We performed a large-scale experimental evalu-

ation utilizing real data to validate our ability to successfully address the challenges discussed in Section 7.1.

The structured data collectionT used was comprised of 1176 structured tables available to usfrom the Bing

search engine. In total, there were around 30 million structured data tuples occupying approximately 400GB on

disk when stored in a database. The same structured data are publicly available via an XML API.3

The tables used represent a wide spectrum of entities, such as Shoes, Video Games, Home Appliances, Televi-

sions, and Digital Cameras. We also used tables with “secondary” complementary entities, such as Camera Lenses

or Camera Accessories that have high vocabulary overlap with “primary” entities in table Digital Cameras. This

way we stress-test result quality on annotations that are semantically different but have very high token overlap.

Besides the structured data collection, we also used logs ofWeb queries posed on the Bing search engine. For

our detailed quality experiments we used a log comprised of 38M distinct queries, aggregated over a period of 5

months.

7.6.1 Algorithms

The annotation generation component presented in Section 7.3 is guaranteed to produce all maximal annotations.

Therefore, we only test its performance as part of our scalability tests presented in Section 7.6.5. We compare the

annotation scoring mechanism against a greedy alternative. Both algorithms score the same set of annotations,

output by the annotation generation component (Section 7.3).

3See http://shopping.msn.com/xml/v1/getresults.aspx?text=televisions for for a table of TVs and http://shopping.msn.com/xml/v1/getspecs.aspx?-itemid=1202956773 for an example of TV attributes.


Annotator SAQ: The SAQ annotator (Structured Annotator of Queries) stands for thefull solution introduced

in this work. Two sets of parameters affecting SAQ’s behavior were identified. The first, is thethresholdparameter

θ used to determine the set of plausible structured annotations, satisfying P (Si)P (SOLM) > θ (Section 7.4). Higher

threshold values render the scorer more conservative in outputting annotations, hence, usually resulting in higher

precision. The second are the language model parameters: the ratioλ/µ that balances our confidence to the

unigram table language model, versus the background open language model, and the penalty parameterφ. We fix

λ/µ = 10 which we found to be a ratio that works well in practice, and captures our intuition for the confidence

we have to the table language model. We consider two variations of SAQ based on the value ofφ: SAQ-MED

(medium-tolerance to free tokens) usingφ = 0.1, and SAQ-LOW (low-tolerance to free tokens) usingφ = 0.01.

Annotator IG-X : The Intelligent Greedy(IG-X) scores annotationsSi based on the number of annotated

tokens|AT i| that they contain, i.e., Score(Si) = |AT i|. The Intelligent Greedy annotator captures the intuition

that higher scores should be assigned to annotations that interpret structurally a larger part of the query. Besides

scoring, the annotator needs to deploy a threshold, i.e., a criterion for eliminating meaningless annotations and

identifying the plausible ones. The set of plausible annotations determined by the Intelligent Greedy annotator are

those satisfying (i)|FT i| ≤ X , (ii) |AT i| ≥ 2 and (iii)P (AT i|T.Ai) > 0. Condition (i) puts an upper boundX

on the number of free tokens a plausible annotation should contain: an annotation with more thanX free tokens

cannot be plausible. Note that the annotator completely ignores the affinity of the free tokens to the annotated

tokens and only reasons based on their number. Condition (ii) demands a minimum of two annotated tokens,

in order to eliminate spurious annotations. Finally, condition (iii) requires that the attribute-value combination

identified by an annotation has a non-zero probability of occurring. This eliminates combinations of attribute

values that have zero probability according to the multi-attribute statistics we maintain (Section 7.5.1).

7.6.2 Scoring Quality

We quantify annotation scoring quality using precision andrecall. This requires obtaining labels for a set of

queries and their corresponding annotations. Since manuallabeling could not be realistically done on the entire

structure data and query collections, we focused on 7 tables: Digital Cameras, Camcorders, Hard Drives, Digital

Camera Lenses, Digital Camera Accessories, Monitors and TVs. The particular tables were selected because

of their high popularity, and also the challenge that they pose to the annotators due to the high overlap of their

corresponding closed language models (CLM). For example, tables TVs and Monitors or Digital Cameras and

Digital Camera Lenses have very similar attributes and values.

The ground truth query set, denotedQ, consists of 50K queries explicitly targeting the 7 tables.The queries

were identified using relevant click log information over the structured data and the query-table pair validity was


manually verified. We then used our tagging process to produce all possible maximal annotations and labeled

manually the correct ones, if any.

We now discuss the metrics used for measuring the effectiveness of our algorithms. An annotator can output

multiple plausible structured annotations per keyword query. We define0 ≤ TP (q) ≤ 1 as the fraction of

correct plausible structured annotations over the total number of plausible structured annotations identified by

an annotator. We also define a keyword query ascoveredby an annotator, if the annotator outputs at least one

plausible annotation. Let also Cov(Q) denote the set of queries covered by an annotator. Then, we define:

Precision=

∑

q∈Q TP (q)

|Cov(Q)| , Recall=

∑

q∈Q TP (q)

|Q|

Figure 7.6 presents the Precision vs Recall plot for SAQ-MED, SAQ-LOW and the IG-X algorithms. Threshold

θ values for SAQ were in the range of0.001 ≤ θ ≤ 1000. Each point in the plot corresponds to a differentθ value.

The SAQ-based annotators and IG-0 achieve very high precision, with SAQ being a little better. To some extent

this is to be expected, given that these are “cleaner” queries, with every single query pre-classified to target the

structured data collection. Therefore, an annotator is less likely to misinterpret open-world queries as a request

for structured data. Notice, however, that the recall of theSAQ-based annotators is significantly higher than that

of IG-0. The IG-X annotators achieve similar recall forX > 0, but the precision degrades significantly. Note

also, that increasing the allowable free tokens from 1 to 5 does not give gains in recall, but causes a large drop in

precision. This is expected since targeted queries are unlikely to contain many free tokens.

Figure 7.6: Precision and Recall using Targeted Queries

Since the query data set is focused only on the tables we consider, we decided to stress-test our approach even

further: we set thresholdθ = 0, effectively removing the adaptable threshold separatingplausible and implausible

annotations, and considered only the most probable annotation. SAQ-MED precision was measured at 78% and


recall at 69% forθ = 0, versus precision 95% and recall 40% forθ = 1. This highlights the following points.

First, even queries targeting the structured data collection can have errors and the adaptive threshold based on

the open-language model can help precision dramatically. Note that errors in this case happen by misinterpreting

queries amongst tables or the attributes within a table, as there are no generic Web queries in this labeled data

set. Second, there is room for improving recall significantly. A query is often not annotated due to issues with

stemming, spell-checking or missing synonyms. For example, we do not annotate token “cannon” when it is used

instead of “canon”, or “hp” when used instead of “hewlett-packard”. An extended structured data collection using

techniques as in [38, 41] can result in significantly improved recall. Finally, we measured that in approximately

19% of the labeled queries, not a single token relevant to theconsidered table attributes was used in the query. This

means there was no possible mapping from the open language used in Web queries to the closed world described

by the available structured data.

7.6.3 Handling General Web Queries

Having established that the proposed solution performs well in a controlled environment where queries are known

to target the structured data collection, we now investigate its quality on general Web queries. We use the full log

of 38M queries, representative of an everyday Web search engine workload. These queries vary a lot in context

and are easy to misinterpret, essentially stress-testing the annotator’s ability to supress false positives.

We consider the same annotator variants: SAQ-MED, SAQ-LOW and IG-X. For each query, the algorithms

output a set of plausible annotations. For each alternative, a uniform random sample of covered queries was

retrieved and the annotations were manually labeled by 3 judges. A different sample for each alternative was

used; 450 queries for each of the SAQ variations and 150 queries for each of the IG variations. In total, 1350

queries were thoroughly hand-labeled. Again, to minimize the labeling effort, we only consider structured data

from the same 7 tables mentioned earlier.

The plausible structured annotations associated with eachquery were labeled asCorrect or Incorrect based

on whether an annotation was judged to represent a highly likely interpretation of the query over our collection of

tablesT . We measure precision as:

Precision=# of correct plausible annotations in the sample

# of plausible annotations in the sample

It is not meaningful to compute recall on the entire query setof 38 million. The vast majority of the Web

queries are general purpose queries and do not target the structured data collection. To compensate, we measured

coverage, defined as the number of covered queries, as a proxy ofrelative recall.

Figure 7.7 presents the annotation precision-coverage plot, for different threshold values. SAQ uses threshold

values ranging in1 ≤ θ ≤ 1000. Many interesting trends emerge from Figure 7.7. With respect to SAQ-MED


Figure 7.7: Precision and Coverage using General Web Queries

and SAQ-LOW, the annotation precision achieved is extremely high, ranging from 0.73 to 0.89 for SAQ-MED

and 0.86 to 0.97 for SAQ-LOW. Expectedly, SAQ-LOW’s precision is higher than SAQ-MED, as SAQ-MED is

more tolerant towards the presence of free tokens in a structured annotation. As discussed, free tokens have the

potential to completely distort the interpretation of the remainder of the query. Hence, by being more tolerant,

SAQ-MED misinterprets queries that contain free tokens more frequently than SAQ-LOW. Additionally, the effect

of the threshold on precision is pronounced for both variations: a higher threshold results value results in higher

precision.

The annotation precision of IG-1 and IG-5 is extremely low, demonstrating the challenge that free tokens

introduce and the value of treating them appropriately. Even a single free token (IG-1) can have a deleterious

effect on precision. However, even IG-0, which only outputsannotations withzero free tokens, offers lower

precision than the SAQ variations. The IG-0 algorithm, by not reasoning in a probabilistic manner, makes a

variety of mistakes, the most important of which to erroneously identify latent structured semantics in open-world

queries. The “white tiger” example mention in Section 7.1 falls in this category. To verify this claim, we collected

and labeled a sample of 150 additional structured annotations that were output by IG-0, but rejected by SAQ-MED

with θ = 1. SAQ’s decision was correct approximately 90% of the time.

With respect to coverage, as expected, the more conservative variations of SAQ, which demonstrated higher

precision, have lower coverage values. SAQ-MED offers higher coverage than SAQ-LOW, while increased thresh-

old values result in reduced coverage. Note also the very poor coverage of IG-0. SAQ, by allowingandproperly

handling free tokens, increases substantially the coverage, without sacrificing precision.


7.6.4 Understanding Annotation Pitfalls

We performed micro benchmarks using the hand-labeled data described in Section 7.6.3 to better understand why

the annotator works well and why not. We looked at the effect of annotation length, free tokens and structured

data overlap.

Number of free tokens

Figures 7.8(a) and 7.9(a) depict the fraction of correct andincorrect plausible structured annotations with respect

to the number of free tokens, for configurations SAQ-LOW (with θ = 1) and IG-5 respectively. For instance, the

second bar of 7.8(a) shows that 35% ofall plausible annotations contain 1 free token: 24% were correct, and 11%

were incorrect. Figures 7.8(b) and 7.9(b) normalize these fractions for each number of free tokens. For instance,

the second bar of Figure 7.8(b) signifies that of the structured annotations with 1 free token output by SAQ-LOW,

approximately 69% were correct and 31% were incorrect.

The bulk of the structured annotations output by SAQ-LOW (Figure 7.8) contain either none or one free token.

As the number of free tokens increases, it becomes less likely that a candidate structured annotation is correct.

SAQ-LOW penalizes large number of free tokens and only outputs structured annotations if it is confident of their

correctness. On the other hand, for IG-5 (Figure 7.9), more than 50% of structured annotations contain at least 2

free tokens. By using the appropriate probabilistic reasoning and dynamic threshold, SAQ-LOW achieves higher

precision even against IG-0 (zero free tokens) or IG-1 (zeroor one free tokens). As we can see SAQ handles the

entire gamut of free-token presence gracefully.

Figure 7.8: SAQ-LOW: Free tokens and precision.

Overall annotation length

Figures 7.10 and 7.11 present the fraction and normalized fraction of correct and incorrect structured annotations

outputted, with respect to annotationlength. The length of an annotation is defined as number of the annotated


Figure 7.9: IG-5: Free tokens and precision.

and free tokens. Note that Figure 7.11 presents results for IG-0 rather than IG-5. Having established the effect

of free tokens with IG-5, we wanted a comparison that focusesmore on annotated tokens, so we chose IG-0 that

outputs zero free tokens.

An interesting observation in Figure 7.10(a) is that although SAQ-LOW has not been constrained like IG-0 to

output structured annotations containing at least 2 annotated tokens, only a tiny fraction of its output annotations

contain a single annotated token. Intuitively, it is extremely hard toconfidentlyinterpret a token, corresponding

to a single attribute value, as a structured query. Most likely the keyword query is an open-world query that was

misinterpreted.

The bulk of mistakes by IG-0 happen for two-token annotations. As the number of tokens increases, it

becomes increasingly unlikely that all 3 or 4 annotated tokens from the same table appeared in the same query

by chance. Finally, note how different the distribution of structured annotations is with respect to the length of

SAQ-LOW (Figure 7.10(a)) and IG-0 (Figure 7.11(a)). By allowing free tokens in a structured annotation, SAQ

can successfully and correctly annotate longer queries, hence achieving much better recall without sacrificing

precision.

Figure 7.10: SAQ-LOW: Annotation length and precision.


Figure 7.11: IG-0: Annotation length and precision.

Types of free tokens in incorrect annotations

Free tokens can completely invalidate the interpretation of a keyword query captured by the corresponding struc-

tured annotation. Figure 7.12 depicts a categorization of the free tokens present in plausible annotations output by

SAQ and labeled asincorrect. The goal of the experiment is to understand the source of theerrors in our approach.

We distinguish four categories of free tokens:(i) Open-world altering tokens: This includes free tokens such

as “review”, “drivers” that invalidate the intent behind a structured annotation and take us outside the closed

world. (ii) Closed-world altering tokens: This includes relevant tokens that are not annotated due toincomplete

structured data and eventually lead to misinterpretations. For example, token “slr” is not annotated in the query

“nikon 35 mm slr” and as a result the annotation for Camera Lenses receives a high score.(iii) Incomplete closed-

world: This includes tokens that would have been annotated if synonyms and spell checking were enabled. For

example, query “panasonic video camera” gets misinterpreted if “video” is a free token. If “video camera” was

given as a synonym of “camcorder” this would not be the case.(iv) Open-world tokens: This contains mostly

stop-words like “with”, “for”, etc.

The majority of errors are in category (i). We note that a large fraction of these errors could be corrected

by a small amount of supervised effort, to identify common open-world altering tokens. We observe also that

the number of errors in categories (ii) and (iii) is lower forSAQ-LOW than SAQ-MED, since (a) SAQ-LOW is

more stringent in filtering annotations and (b) it down-weights the effect of free tokens and is thus hurt less by not

detecting synonyms.

Overlap on structured data

High vocabulary overlap between tables introduces a potential source of error. Table 7.1 presents a “confusion

matrix” for SAQ-LOW. Every plausible annotation in the sample is associated with two tables: the actual table

targeted by the corresponding keyword query (“row” table) and the table that the structured annotation suggests

as targeted (“column” table). Table 7.1 displays the row-normalized fraction of plausible annotations output for


Figure 7.12: Free tokens in incorrect annotations.

Predicted→

Actual↓

Cameras Camcorders Lenses Accessories OLM

Cameras 92% 2% 4% 2% 0%

Camcorders 4% 96% 0% 0% 0%

Lenses 2% 0% 94% 4% 0%%

Accessories 13% 3% 3% 81% 0%

OLM 7% 2% 0% 1% 90%

Table 7.1: Confusion matrix for SAQ-LOW.

each actual-predicted table pair. For instance, for 4% of the queries relevant to table Camcorders, the plausible

structured annotation identified table Digital Cameras instead. We note that most of the mass is on the diagonal,

indicating that SAQ correctly determines the table and avoids class confusion.The biggest error occurs on camera

accessories, where failure to understand free tokens (e.g., “batteries” in query “nikon d40 camera batteries”) can

result in producing high score annotations for the Cameras table.

7.6.5 Efficiency of Annotation Process

We performed an experiment to measure the total time required by SAQ to generate and score annotations for

the queries of our full Web log. The number of tables was varied in order to quantify the effect of increasing

table collection size on annotation efficiency. The experimental results are depicted in Figure 7.13. The figure

presents the mean time required to annotate a query: approximately1 millisecondis needed to annotate a keyword

query in the presence of 1176 structured data tables. Evidently, the additional overhead to general search-engine

query processing is minuscule, even in the presence of a large structured data collection. We also observe a

linear increase of annotation latency with respect to the number of tables. This can be attributed to the number


of structured annotations generated and considered by SAQ increasing at worst case linearly with the number of

tables.

The experiment was executed on a single server and the closedstructured model for all 1176 tables required

10GB of memory. It is worth noting that our solution is decomposable, ensuring high parallelism. Therefore,

besides low latency that is crucial for Web search, a production system can afford to use multiple machines to

achieve high query throughput. For example, based on a latency of 1ms per query, 3 machines would suffice for

handling a hypothetical Web search-engine workload of 250Mqueries per day.

0

0.2

0.4

0.6

0.8

1

0 500 1000

Tim

e p

er

Qu

ery

(m

s)

# of Tables

SAQ Linear (SAQ)

Figure 7.13: SAQ: On-line efficiency.


A problem related to generating plausible structured annotations, referred to asWeb query tagging, was introduced

in [98]. Its goal is to assign each query term to a specified category, roughly corresponding to a table attribute.

A Conditional Random Field (CRF) is used to capture dependencies between query words and identify the most

likely joint assignment of words to “categories”. Query tagging can be viewed as a simplification of the query

annotation problem considered in this work. One major difference is that in [98] structured data are not organized

into tables.This assumption severely restricts the allowed applicability of the solution to multiple domains, as

there is no mechanism to disambiguate between arbitrary combinations of attributes. Second, the possibility of

not attributing a word to any specific category is not considered. This assumption is incompatible with the general

Web setting. Finally, training of the CRF is performed in asemi-supervisedfashion and hence the focus of [98]

is on automatically generating and utilizing training datafor learning the CRF parameters. Having said that, the

scale of the Web demands an unsupervised solution; anythingless will encounter issues when applied to diverse

structured domains.

Keyword search on relational [77, 101, 88], semi-structured [67, 102] and graph data [87, 72] (Keyword Search


Over Structured Data, abbreviated as KSOSD) has been an extremely active research topic. Its goal is the efficient

retrieval of relevant database tuples, XML sub-trees or subgraphs in response to keyword queries. The problem

is challenging since the relevant pieces of information needed to assemble answers are assumed to be scattered

across relational tables, graph nodes, etc. Essentially, KSOSD techniques allow users to formulate complicated

join queries against a database using keywords. The tuples returned are ranked based on the “distance” in the

database of the fragments joined to produce a tuple, and the textual similarity of the fragments to query terms.

The assumptions, requirements and end-goal of KSOSD are radically different from the Web query annota-

tion problem that we consider. Most importantly, KSOSD solutions implicitly assume that users are aware of

the presence and nature of the underlying data collection, although perhaps not its exact schema, and that they

explicitly intent to query it. Hence, the focus is on the assembly, retrieval and ranking of relevant results (tuples).

On the contrary, Web users are oblivious to the existence of the underlying data collection and their queries might

even be irrelevant to it. Therefore, the focus of the query annotation process is on discovering latent structure in

Web queries and identifying plausible user intent. This information can subsequently be utilized for the benefit of

structured data retrieval and KSOSD techniques. For a thorough survey of the KSOSD literature and additional

references see [40].

7.8 Conclusions

Integrating structured data into Web search presents unique and formidable challenges, with respect to both result

quality and efficiency. Towards addressing such problems wedefined the novel notion ofStructured Annotations

as a mapping of a query to a table and its attributes. We showedan efficient process that creates all such annota-

tions and presented a probabilistic scorer that has the ability to sort and filter annotations based on the likelihood

they represent meaningful interpretations of the user query. The end to end solution is highly efficient, demon-

strates attractive precision/recall characteristics andis capable of adapting to diverse structured data collections

and query workloads in a completely unsupervised fashion.

Chapter 8

Conclusions

Increasingly, research on textual data management adopts aview of documents that is aligned with their true com-

plexity: extended documentscomprised of both text and meta-data and document collections that aredynamicand

integrated(Chapters 1 and 2). In this context, we presented a gamut of efficient solutions enabling sophisticated,

novel functionality for interacting with textual data; techniques that leverage the extended nature of documents

and document collections. It is our hope that the material presented in this thesis will help stimulate both new

research work, as well as inspire the development of useful real-world applications.

As a concluding attempt towards this goal, we summarize in Section 8.1 the most important – and hopefully

useful – lessons we gained while working on this thesis. In Section 8.2 we discuss the potential utility of our

ideas, techniques and algorithms in applications other than the ones originally meant to support. In Section 8.3

we suggest possible research directions extending the ideas presented herein.

Finally, we note that besides documenting our techniques and results, we were actively involved in the proto-

typing of applications based on them. Grapevine1 [9] is an on-line application that allows its user to interactively

explore stories capturing attention in social media. At thetime of writing the system processes 2.5 Million blog

posts daily and tracks the stories being discussed across 660 Thousand demographic segments. The algorithms

powering the system and enabling its functionality are the ones described in Chapter 3. Additionally, the Web

query analysis technique presented in Chapter 7 is incorporated into Microsoft’s Helix project2 [122], an ongoing

effort on integrating structured data sources into Web search. Similarly, we are confident that all solutions pre-

sented are equally suitable for real applications, as corroborated by the real-data experimental results presented in

Chapters 3 through 7.

1www.onthegrapevine.ca2http://research.microsoft.com/en-us/projects/helix/default.aspx

177

CHAPTER 8. CONCLUSIONS 178

8.1 Lessons Learnt

Documents are inherently complex and “noisy” data points for algorithms to operate upon. Content generated

on-line compounds the relevant challenges. Increasingly,extended documents and their accompanying meta-data

are the byproduct of social user activity. Blogs, tweets, reviews, tags and links shared are few examples. As such,

they exhibit great variety in their content, form, conventions and use, both at a single time instance and across

time. This is in contrast to the more well-behaved, uniform and static electronic document collections dominating

the pre-Web and early-Web eras, such as news articles, professionally developed web pages or legal documents.

Nonetheless, at the core of all algorithms for querying, exploring and mining extended documents lies an attempt

to identify and exploit persistent patterns in the relationships between documents, meta-data, time and related

data collections. How does one proceed with designing successful techniques in the presence of such variance

and apparent lack of structure?

Throughout our work, we found relatively simple probabilistic models, capturing our intuition about user

behavior, and supported by large quantities of readily available data, to yield promising results. This is the

approach we adopted in Chapters 3, 4, 5 and 7. Good intuition about how users generate textual content and meta-

data, or how this process evolves over time, is required for guiding our search for patterns and structure. Ideally,

our hypothesis about user behavior should be translated into a principled probabilistic model. Such models,

besides their many other benefits, serve as a straightjacketthat prevents us from over-engineering our solutions

by introducing ad-hoc heuristics. The resulting model neednot be exceedingly complex. No matter how flexible

the model, it will not be able to account for the all the variance in user behavior. To compensate and enable our

models to accurately and confidently capture patterns of user behavior amongst considerable noise, we need to

leverage the massive amounts of available data.

In turn, this requirement necessitates efficient and scalable algorithms. Consequently, research problems re-

lated to querying, exploring and mining extended documentseventually manifest themselves as data management

problems. Because of this relationship, the focus of our technical contributions was on designing such algo-

rithms for supporting our solutions and experimentally evaluating their performance. We believe that no solution

can be considered complete unless it explicitly considers the issues arising from the need to manipulate massive

quantities of textual data.

As described, the first lesson gained was how to approach technical problems in this area. The second was

how to identify promising, relevant and useful research problems to address. For this task, we found the notions of

extended documents, dynamic and integrated document collections, to be a valuable guide. We believe that there

are advantages to adopting this unifying view, rather than considering tags, sentiment, links or document streams,

as distinct developments and areas of study. It provides a mental framework for anticipating and incorporating


into an existing framework new developments. For instance,as new types of content or meta-data are generated,

existing ideas, approaches and techniques could be highly relevant. It also raises the opportunity for jointly

utilizing multiple types of meta-data, as well as considering their evolution over time, in order to offer rich and

improved functionality.

8.2 New Applications

8.2.1 Mining entity associations from all the slices of a document collection

In Chapter 3 we introduced efficient algorithms for computing either all sufficiently strong, or thek strongest asso-

ciations among entities extracted from documents’ textualcomponent. The application of the algorithms on each

interesting subset (slice) of the document collection was supported by an efficient data-assembly infrastructure. A

uniquecharacteristic of the algorithms presented is their ability to accommodate virtually any measure for quan-

tifying association strength. This flexibility enables theuse of robust statistical measures which are appropriate

for text mining applications.

While our motivation for developing this machinery was exposing the stories discussed by each demographic

segment of bloggers, they are general and flexible enough to be applied on numerous other document collections

or settings. For instance, they could be used to identify strong word or feature associations in a review collection of

hierarchically organized products. The “stories” exposedin this manner would correspond to discussion regarding

an aspect of a product or product category. Another application could be the efficient generation of word co-

occurrence information to be consumed by the query expansion functionality discussed next. Our algorithms

provide the efficient computation of a powerful primitive (statistical associations) which can be useful in many

applications.

8.2.2 Measure-driven keyword query expansion

In Chapter 4 we introduced a novel approach for interactively suggesting expansions to a keyword query. Ex-

pansions are suggested in a data-driven manner, by computing the top expansions that maximize a measure of

“interestingness”. This way, large sets of keyword query results can be navigated and explored in an informative

manner. While we focus our presentation, motivation and experiments on three particular measures, the com-

putational framework enabling this functionality is much more general and supports a wide range of measures,

including ones that are based on the distribution of user ratings associated with the documents satisfying the

expanded query.

The algorithmic machinery supporting this functionality is based on the estimation of high-order word co-


occurrences from lower-order ones, based on the principle of Maximum Entropy. While this approach has been

used before [53], we deploy it with the goal of trading-off accuracy for efficiency: a concept which could be

relevant for other computationally intensive applications. The second ingredient of our approach is the use of

the Ellipsoid Method in order to derive bounds around the optimum of an optimization problem, which are sub-

sequently used for pruning. This is also a concept that is generally applicable and can prove relevant to other

applications, as pruning is a pervasive theme in data processing and mining algorithms.

8.2.3 Improved keyword search using tags

In Chapter 5 we developed a principled probabilistic approach for ranking documents based on their user-assigned

tags. Its main element is the use of interpolatedn-grams for modeling the tag sequences associated with each doc-

ument, and their subsequent use for determining the probability that a document is relevant to a keyword query.

This novel approach to modeling user-assigned tags was complemented by an optimization framework for effi-

ciently computing and incrementally maintaining the modelinterpolation parameters. An extensive experimental

evaluation verified the gains that can be realized by using interpolatedn-grams to capture the structure of the tag

sequences.

The focus of Chapter 5 was on relating a keyword query with only the tags associated with a document. Nev-

ertheless, the relevance probabilities or the document ranking generated by our technique can be easily combined

with other features or rankings and incorporated into a larger Information Retrieval system. The training-time and

query-time efficiency of the interpolatedn-grams render this integration possible.

Finally, while the presented optimization framework was developed with the intent of efficiently computing

the interpolatedn-gram parameters, it can be used in other applications. It provides an efficient alternative for

computing the interpolation parameters of any mixture model that combines a small number of models with

known parameters. Applications that require the training of a large number of such “low-dimensional” mixture

models would benefit from our framework.

8.2.4 Dynamic skyline maintenance for extended documents

Previous work presented identified a generic framework for maintaining the skyline of a data stream. In Chapter

6 we instantiate this framework for extended documents associated with partially-ordered categorical attributes,

by presenting two highly efficient data structures: a grid-based one for indexing the most recent documents of the

stream and one based on geometric arrangements to index the current skyline documents. The two data structures

support the operations required to mend the skyline after documents enter or leave the streaming buffer.

Before being indexed by the arrangement-based data structure, the skyline documents are mapped to lines.


Thisdualrepresentation of documents has a rich structure that offers fresh insight into the dominance relationships

among documents and allows the use of otherwise inapplicable concepts, algorithms and data structures from

Computational Geometry. Hence, the potential utility of the transformation transcends the skyline maintenance

problem under study and could be used in the context of other applications related to ranking and dominance.

8.2.5 Integrating structured data into Web search

In Chapter 7 we developed a Web query analysis component enabling the integration of structured data sources

into Web search. All possible mappings, referred to as annotations, of a Web query to the structured data tables

and their attributes are generated by a highly efficient, decomposable and parallelizable process. Subsequently,

annotations are scored in an intuitive probabilistic manner. A crucial concept in our approach is the dynamic,

query-specific threshold that singles out only those annotations that correspond to sensible interpretations of the

user query. The outputted annotations and the associated probabilistic semantics are meant to be used for the

subsequent retrieval of actual structured data entries andtheir appropriate presentation along with regular Web

results. Overall, the process imposes minimal overhead on Web query processing while offering high precision

and recall.

Our Query Annotator represents the first complete approach for transparently exposing structured data sources

on the Web and performs well. More importantly, stripped of its implementation details, our approach introduces

a framework for addressing the problem that is both extensible and modular. It is extensible as the definition of

an annotation can be enriched. It is also modular as its components, i.e., the tagger, the scorer, and the various

techniques for mining probability estimates, can be enhanced or substituted without affecting the remaining ones.

Hence, we believe that our technique can serve as the basis for further improvements.

8.3 Possible Future Directions

In the future, we believe that dynamic streams of extended documents will attract significant attention from the

research community, resulting in real-time data processing techniques such as the one presented in Chapter 6. The

pace at which new textual content is generated and the rapidity with which its characteristics change necessitate

efficient, on-line solutions capable of identifying trendsand patterns as they emerge (e.g., [110]).

In this context, one interesting research direction would be the extension of the techniques presented in Chap-

ter 3 to such dynamic environment. In Chapter 3 we discussed the efficient computation of top entity associations

for all the slices of astaticextended document collection. Hence, our algorithms are only applicable to a recent

snapshot of a dynamic collection. Ideally, we would like toincrementallymaintain slice-specific strong entity as-

sociations as new documents are generated and old ones ceasebeing relevant. This capability would be beneficial


both from a computational perspective and an applications perspective. Computational overhead can be reduced

by applying changes to previously discovered information as new documents arrive, rather than periodically pro-

cessing overlapping snapshots as if they were independent.With respect to applications, incremental computation

will enable the identification of strong entity associations (and the underlying stories) at the exact point in time

where they become significant3, thus enabling on-line event detection and tracking functionality.

The skyline maintenance functionality for dynamic document collections presented in Chapter 6 can also

be enhanced. We discussed how partially-ordered categorical meta-data attributes can be used to isolate and

incrementally maintain a subset of uniquely interesting recent documents. A potential limitation of our approach

is that the preferences captured by the partially ordered domains are assumed to be immutable. In the domain of

rapidly evolving dynamic collections this could be too harda constraint. Suppose that an attribute corresponds to

entities extracted from text. Our preferences over the presence of entities in documents might be time dependent,

e.g., we might be more interested in entities thatcurrentlyexhibit a bursty occurrence pattern instead of those that

follow their regular one. Hence, our preferences are evolving together with the focus of recently generated content.

It would be interesting to study how the skyline of recent extended documents can be efficiently maintained in the

presence of evolving preferences (changing partial orders).

Besides dynamic document collections, techniques operating on integrated ones are likely to receive increasing

research attention, as a response to the fragmentation of on-line content. In Chapter 7 we presented work towards

the integration of structured data into Web search. Other techniques have studied the integration of streaming

user-generated content into Web search [51]. An interesting research direction would be a holistic approach to

integrating data – other than Web pages – into Web search, that can uniformly handle diverse data types, while

at the same time performing deep query analysis like that of Chapter 7 (In contrast to the classification-only

techniques of [11, 12]).

But before achieving this ambitious goal, the solution of Chapter 7 can be extended in many useful directions.

One of its major contributions is the introduction of a framework that is both extensible and modular. It is extensi-

ble, since the definition of a structured annotation can be made richer. For instance, it can be extended to include

modifierattributes and values. Processing of Web queries such as “good cheapnikon digital cameras” or “high

resolutionLG tvs” would benefit from the interpretation and appropriate treatment of tokens “good”, “cheap” and

“high resolution”. Is is also modular, since the scoring of structured annotations depends on probability estimates

that can be computed in any way desired (Section 7.5). The estimates need to be accurate to guarantee high

precision and recall, but the entire framework is independent of the exact methods used. Therefore, the solution

would greatly benefit from the development of more accurate techniques for deriving probability estimates. As

3Statistically significant if an appropriate measure of association is used.


an example, our techniques made use of query log data, but failed to use another invaluable source of information

in the form of click-through data. Patterns extracted from click-though data, query logs, Web pages, the struc-

tured data themselves, etc., can be used towards the goal of significantly improving the effectiveness of our Query

Annotator.

Finally, we observe that the functionality illustrated in Chapter 3 and implemented by Grapevine [9] shares

many principles with OLAP4 tools [66]. Such tools allow the interactive exploration ofmassive,numericaldata

sets by aggregating them in an intuitive hierarchical manner. Many aspects of OLAP functionality is relevant

to the textual data exploration paradigm of Chapter 3 and cantherefore be used to enhance it. E.g., inspired by

[129], a possible direction is the development of techniques for guidingthe user towards interesting slices of the

document collection. As a concrete example, a user that has already explored the stories discussed by “people in

the US” and then proceeded to slice “males in the US”, can be suggested that he explores even more specific slice

“males in Eastern US”. This suggestion can be driven by the fact that the latter slice is associated with content

that deviates substantially from the former two more general slices and, hence, contains relevant incremental

information. The definition of appropriate semantics for guiding the user and the efficient implementation of

supporting techniques remains an open problem.

4On-line Analytical Processing.

Bibliography

[1] Eytan Adar and Lada A. Adamic. Tracking information epidemics in blogspace. InWeb Intelligence, pages

207–214, 2005.

[2] Pankaj Kumar Agarwal and Micha Sharir. Arrangements andtheir applications. InHandbook of Compu-

tational Geometry, chapter 2, pages 49–119. Elsevier, 2000.

[3] Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakr-

ishnan, and Sunita Sarawagi. On the computation of multidimensional aggregates. InVLDB, 1996.

[4] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. Finding high-

quality content in social media. InWSDM, pages 183–194, 2008.

[5] Eugene Agichtein and Sunita Sarawagi. Scalable information extraction and integration. InKDD Tutorial,

2006.

[6] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases.

In VLDB, 1994.

[7] Sihem Amer-Yahia, Michael Benedikt, Laks V. S. Lakshmanan, and Julia Stoyanovich. Efficient network

aware search in collaborative tagging sites.PVLDB, 1(1):710–721, 2008.

[8] Albert Angel, Surajit Chaudhuri, Gautam Das, and Nick Koudas. Ranking objects based on relationships

and fixed associations. InEDBT, pages 910–921, 2009.

[9] Albert Angel, Nick Koudas, Nikos Sarkas, and Divesh Srivastava. What’s on the grapevine? InSIGMOD,

pages 1047–1050, 2009.

[10] Douglas E. Appelt and David Israel. Introduction to information extraction technology. InIJCAI Tutorial,

1999.

184

BIBLIOGRAPHY 185

[11] Jaime Arguello, Jamie Callan, and Fernando Diaz. Classification-based resource selection. InCIKM, pages

1277–1286, 2009.

[12] Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical

selection. InSIGIR, pages 315–322, 2009.

[13] Sitaram Asur and Bernardo A. Huberman. Predicting the future with social media.CoRR, abs/1003.5699,

2010.

[14] Mikhail J. Atallah and Susan Fox, editors.Algorithms and Theory of Computation Handbook. CRC Press,

Inc., 1st edition, 1998.

[15] Giorgio Ausiello, Marco Protasi, Alberto Marchetti-Spaccamela, Giorgio Gambosi, Pierluigi Crescenzi,

and Viggo Kann.Complexity and Approximation: Combinatorial Optimization Problems and Their Ap-

proximability Properties. Springer-Verlag New York, Inc., 1st edition, 1999.

[16] Nilesh Bansal, Fei Chiang, Nick Koudas, and Frank Wm. Tompa. Seeking stable clusters in the blogo-

sphere. InVLDB, 2007.

[17] Shenghua Bao, Gui-Rong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su. Optimizing web search

using social annotations. InWWW, pages 501–510, 2007.

[18] Ziv Bar-Yossef and Maxim Gurevich. Mining search engine query logs via suggestion sampling.PVLDB,

1(1), 2008.

[19] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In

WWW, 2007.

[20] M.J. Bayarri and James O. Berger. Measures of surprise in bayesian analysis. InDuke University, 1997.

[21] Ori Ben-Yitzhak, Nadav Golbandi, Nadav Har’El, Ronny Lempel, Andreas Neumann, Shila Ofek-Koifman,

Dafna Sheinwald, Eugene J. Shekita, Benjamin Sznajder, andSivan Yogev. Beyond basic faceted search.

In WSDM, 2008.

[22] Jon Louis Bentley and Robert Sedgewick. Fast algorithms for sorting and searching strings. InSODA,

pages 360–369, 1997.

[23] Michael K. Bergman. The deep web: Surfacing hidden value.Journal of Electronic Publishing, 7(1), 2001.

[24] Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 1st edition, 2006.

BIBLIOGRAPHY 186

[25] Yvonne Bishop, Stephen Fienberg, and Paul Holland.Discrete Multivariate Analysis: Theory and Practice.

The MIT Press, 1977.

[26] Robert G. Bland, Donald Goldfarb, and Michael J. Todd. The ellipsoid method: A survey.Operations

Research, 29(6), 1981.

[27] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine

Learning Research, 3:993–1022, 2003.

[28] Stephan Borzsonyi, Donald Kossmann, and Konrad Stocker. The skyline operator. InICDE, pages 421–

430, 2001.

[29] Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 1st edition,

2004.

[30] Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: Generalizing association

rules to correlations. InSIGMOD, 1997.

[31] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine.Computer

Networks, 30(1-7):107–117, 1998.

[32] Mark James Carman, Mark Baillie, Robert Gwadera, and Fabio Crestani. A statistical comparison of tag

and query logs. InSIGIR, pages 123–130, 2009.

[33] Ciro Cattuto, Vittorio Loreto, and Luciano Pietronero. From the Cover: Semiotic dynamics and collabora-

tive tagging.PNAS, 104(5):1461–1464, 2007.

[34] Soumen Chakrabarti.Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman,

2002.

[35] Chee Yong Chan, Pin-Kwang Eng, and Kian-Lee Tan. Stratified computation of skylines with partially-

ordered domains. InSIGMOD, pages 203–214, 2005.

[36] Chee Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K.H. Tung, and Zhenjie Zhang. Finding k-

dominant skylines in high dimensional space. InSIGMOD, pages 503–514, 2006.

[37] Chee Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K.H. Tung, and Zhenjie Zhang. On high

dimensional skylines. InEDBT, pages 478–495, 2006.

[38] Surajit Chaudhuri, Venkatesh Ganti, and Dong Xin. Exploiting web search to generate synonyms for

entities. InWWW, pages 151–160, 2009.

BIBLIOGRAPHY 187

[39] Stanley F. Chen and Joshua T. Goodman. An empirical study of smoothing techniques for language mod-

eling. Technical Report TR-10-98, Harvard University, 1998.

[40] Yi Chen, Wei Wang, Ziyang Liu, and Xuemin Lin. Keyword search on structured and semi-structured data.

In SIGMOD, pages 1005–1010, 2009.

[41] Tao Cheng, Hady Wirawan Lauw, and Stelios Paparizos. Fuzzy matching of web queries to structured data.

In ICDE, pages 713–716, 2010.

[42] Tao Cheng, Xifeng Yan, and Kevin Chen-Chuan Chang. Entityrank: Searching entities directly and holis-

tically. In VLDB, pages 387–398, 2007.

[43] Jan Chomicki, Parke Godfrey, Jarek Gryz, and Dongming Liang. Skyline with presorting. InICDE, pages

717–816, 2003.

[44] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D.

Ullman, and Cheng Yang. Finding interesting associations without support pruning.IEEE Trans. Knowl.

Data Eng., 13(1), 2001.

[45] Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition,

2006.

[46] Wisam Dakka and Panagiotis G. Ipeirotis. Automatic extraction of useful facet hierarchies from text

databases. InICDE, pages 466–475, 2008.

[47] Wisam Dakka, Panagiotis G. Ipeirotis, and Kenneth R. Wood. Automatic construction of multifaceted

browsing interfaces. InCIKM, 2005.

[48] Gautam Das, Dimitrios Gunopulos, Nick Koudas, and Nikos Sarkas. Ad-hoc top-k query answering for

data streams. InVLDB, pages 183–194, 2007.

[49] Richard Dawkins.The Selfish Gene. Oxford University Press, 2006.

[50] Mark de Berg, Marc van Kreveld, Mark Overmars, and Otfried Schwarzkopf.Computational Geometry:

Algorithms and Applications. Springer-Verlag, 2nd edition, 2000.

[51] Fernando Diaz. Integration of news content into web results. InWSDM, pages 182–191, 2009.

[52] Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, JingBai, Ruiqiang Zhang, Karolina Buchner, Ciya

Liao, and Fernando Diaz. Towards recency ranking in web search. InWSDM, pages 11–20, 2010.

BIBLIOGRAPHY 188

[53] William DuMouchel and Daryl Pregibon. Empirical bayesscreening for multi-item associations. InKDD,

2001.

[54] Ted Dunning. Accurate methods for the statistics of surprise and coincidence.Computational Linguistics,

19(1), 1993.

[55] Ronald Fagin, Ravi Kumar, Mohammad Mahdian, D. Sivakumar, and Erik Vee. Comparing and aggregating

rankings with ties. InPODS, pages 47–58, 2004.

[56] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware.J. Comput.

Syst. Sci., 66(4):614–656, 2003.

[57] Christos Faloutsos, H. V. Jagadish, and Nikolaos Sidiropoulos. Recovering information from summary

data. InVLDB, 1997.

[58] Ronen Feldman. Information extraction : Theory and practice. InICML Tutorial, 2006.

[59] Alan M. Frieze and Mark Jerrum. Improved approximationalgorithms for max -cut and max bisection. In

IPCO, pages 1–13, 1995.

[60] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Philip S. Yu, andHongjun Lu. Parameter free bursty events

detection in text streams. InVLDB, pages 181–192, 2005.

[61] Anindya Ghose and Panagiotis G. Ipeirotis. Estimatingthe Helpfulness and Economic Impact of Product

Reviews: Mining Text and Reviewer Characteristics.SSRN eLibrary, 2010.

[62] Alison L. Gibbs and Francis Edward Su. On choosing and bounding probability metrics.International

Statistical Review / Revue Internationale de Statistique, 70(3):419–435, 2002.

[63] Parke Godfrey, Ryan Shipley, and Jarek Gryz. Maximal vector computation in large data sets. InVLDB,

pages 229–240, 2005.

[64] Scott A. Golder and Bernardo A. Huberman. Usage patterns of collaborative tagging systems.J. Informa-

tion Science, 32(2):198–208, 2006.

[65] Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403–

434, 2001.

[66] Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh. Data cube: A relational aggregation

operator generalizing group-by, cross-tab, and sub-total. In ICDE, 1996.

BIBLIOGRAPHY 189

[67] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. Xrank: Ranked keyword search

over xml documents. InSIGMOD, pages 16–27, 2003.

[68] Zoltan Gyongyi, Hector Garcia-Molina, and Jan O. Pedersen. Combating web spam with trustrank. In

VLDB, pages 576–587, 2004.

[69] Dan Halperin. Arrangements. InHandbook of Discrete and Computational Geometry, chapter 24, pages

529–562. CRC Press, Inc., 2004.

[70] Harry Halpin, Valentin Robu, and Hana Shepherd. The complex dynamics of collaborative tagging. In

WWW, pages 211–220, 2007.

[71] Taher H. Haveliwala. Topic-sensitive pagerank. InWWW, pages 517–526, 2002.

[72] Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. Blinks: Ranked keyword searches on graphs. In

SIGMOD, pages 305–316, 2007.

[73] Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Can social bookmarking improve web

search? InWSDM, pages 195–206, 2008.

[74] Djoerd Hiemstra. A probabilistic justification for using tfxidf term weighting in information retrieval.Int.

J. on Digital Libraries, 3(2):131–139, 2000.

[75] Thomas Hofmann. Probabilistic latent semantic indexing. InSIGIR, pages 50–57, 1999.

[76] Andreas Hotho, Robert Jaschke, Christoph Schmitz, and Gerd Stumme. Information retrieval in folk-

sonomies: Search and ranking. InESWC, pages 411–426, 2006.

[77] Vagelis Hristidis, Luis Gravano, and Yannis Papakonstantinou. Efficient ir-style keyword search over rela-

tional databases. InVLDB, pages 850–861, 2003.

[78] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. Exploiting wikipedia as external

knowledge for document clustering. InKDD, pages 389–396, 2009.

[79] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting top-k join queries in relational

databases. InVLDB, 2003.

[80] Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown, and Ashraf Aboulnaga. Cords: Automatic discovery

of correlations and soft functional dependencies. InSIGMOD, pages 647–658, 2004.

[81] Yannis E. Ioannidis. The history of histograms. InVLDB, pages 19–30, 2003.

BIBLIOGRAPHY 190

[82] Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano. To search or to crawl?: towards

a query optimizer for text-centric tasks. InSIGMOD, pages 265–276, 2006.

[83] H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, and Torsten Suel.

Optimal histograms with quality guarantees. InVLDB, pages 275–286, 1998.

[84] Glen Jeh and Jennifer Widom. Scaling personalized web search. InWWW, pages 271–279, 2003.

[85] Thorsten Joachims. Optimizing search engines using clickthrough data. InKDD, pages 133–142, 2002.

[86] Daniel Jurafsky and James H. Martin.Speech and Language Processing: An Introduction to Natural

Language Processing, Computational Linguistics, and Speech Recognition. MIT Press, 1st edition, 2000.

[87] Varun Kacholia, Shashank Pandit, Soumen Chakrabarti,S. Sudarshan, Rushi Desai, and Hrishikesh Karam-

belkar. Bidirectional expansion for keyword search on graph databases. InVLDB, pages 505–516, 2005.

[88] Eser Kandogan, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and Huaiyu Zhu.

Avatar semantic search: A database approach to informationretrieval. InSIGMOD06, pages 790–792,

2006.

[89] David Kempe, Jon M. Kleinberg, andEva Tardos. Influential nodes in a diffusion model for socialnetworks.

In ICALP, pages 1127–1138, 2005.

[90] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.J. ACM, 46(5):604–632, 1999.

[91] Jon M. Kleinberg. Bursty and hierarchical structure instreams. InKDD, pages 91–101, 2002.

[92] Donald Kossmann, Frank Ramsak, and Steffen Rost. Shooting stars in the sky: An online algorithm for

skyline queries. InVLDB, pages 275–286, 2002.

[93] Nick Koudas, Beng Chin Ooi, Kian-Lee Tan, and Rui Zhang.Approximate nn queries on streams with

guaranteed error/performance bounds. InVLDB, pages 804–815, 2004.

[94] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,and Andrew Tomkins. Trawling the web for

emerging cyber-communities.Computer Networks, 31(11-16):1481–1493, 1999.

[95] Ken Lee, Baihua Zheng, Huajing Li, and Wang-Chien Lee. Approaching the skyline in z order. InVLDB,

pages 279–290, 2007.

[96] Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie S. Glance, and Matthew Hurst. Patterns of

cascading behavior in large blog graphs. InSDM, 2007.

BIBLIOGRAPHY 191

[97] Rui Li, Shenghua Bao, Yong Yu, Ben Fei, and Zhong Su. Towards effective browsing of large scale social

annotations. InWWW, pages 943–952, 2007.

[98] Xiao Li, Ye-Yi Wang, and Alex Acero. Extracting structured information from user queries with semi-

supervised conditional random fields. InSIGIR, pages 572–579, 2009.

[99] Xuemin Lin, Yidong Yuan, Wei Wang, and Hongjun Lu. Stabbing the sky: Efficient skyline computation

over sliding windows. InICDE, pages 502–513, 2005.

[100] Xuemin Lin, Yidong Yuan, Qing Zhang, and Ying Zhang. Selecting stars: The k most representative

skyline operator. InICDE, pages 86–95, 2007.

[101] Fang Liu, Clement T. Yu, Weiyi Meng, and Abdur Chowdhury. Effective keyword search in relational

databases. InSIGMOD, pages 563–574, 2006.

[102] Ziyang Liu and Yi Chen. Reasoning and identifying relevant matches for xml keyword search.PVLDB,

1(1):921–932, 2008.

[103] Yue Lu, ChengXiang Zhai, and Neel Sundaresan. Rated aspect summarization of short comments. In

WWW, pages 131–140, 2009.

[104] David Luenberger.Linear and Nonlinear Programming. Kluwer Academic Publishers, 2nd edition, 2003.

[105] Chris D. Manning and Hinrich Schutze.Foundations of Statistical Natural Language Processing. MIT

Press, 1st edition, 1999.

[106] Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze.Introduction to Information Retrieval.

Cambridge University Press, 2008.

[107] Volker Markl, Peter J. Haas, Marcel Kutsch, Nimrod Megiddo, Utkarsh Srivastava, and Tam Minh Tran.

Consistent selectivity estimation via maximum entropy.VLDB J., 16(1), 2007.

[108] Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam MinhTran, Peter J. Haas, and Utkarsh Srivastava.

Consistently estimating the selectivity of conjuncts of predicates. InVLDB, 2005.

[109] Michael Mathioudakis and Nick Koudas. Efficient identification of starters and followers in social media.

In EDBT, pages 708–719, 2009.

[110] Michael Mathioudakis, Nick Koudas, and Peter Marbach. Early online identification of attention gathering

items in social media. InWSDM, pages 301–310, 2010.

BIBLIOGRAPHY 192

[111] David R. H. Miller, Tim Leek, and Richard M. Schwartz. Ahidden markov model information retrieval

system. InSIGIR, pages 214–221, 1999.

[112] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, 1995.

[113] Satoshi Morinaga and Kenji Yamanishi. Tracking dynamics of topic trends using a finite mixture model.

In KDD, pages 811–816, 2004.

[114] Michael D. Morse, Jignesh M. Patel, and William I. Grosky. Efficient continuous skyline computation. In

ICDE, page 108, 2006.

[115] Kyriakos Mouratidis, Spiridon Bakiras, and DimitrisPapadias. Continuous monitoring of top-k queries

over sliding windows. InSIGMOD, pages 635–646, 2006.

[116] Oliver Nelles.Nonlinear System Identification. Springer, 1st edition, 2000.

[117] Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer, 2nd edition, 2006.

[118] Themis Palpanas, Nick Koudas, and Alberto O. Mendelzon. Using datacube aggregates for approximate

querying and deviation detection.IEEE Trans. Knowl. Data Eng., 17(11), 2005.

[119] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis.Foundations and Trends in Information

Retrieval, 2(1-2):1–135, 2007.

[120] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine

learning techniques.CoRR, cs.CL/0205070, 2002.

[121] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. An optimal and progressive algorithm for

skyline queries. InSIGMOD, pages 467–478, 2003.

[122] Stelios Paparizos, Alexandros Ntoulas, John C. Shafer, and Rakesh Agrawal. Answering web queries using

structured data sources. InSIGMOD, pages 1127–1130, 2009.

[123] Dmitry Pavlov, Heikki Mannila, and Padhraic Smyth. Probabilistic models for query approximation with

large sparse binary data sets. InUAI, 2000.

[124] Andrew Steven Pollitt. The key role of classification and indexing in view-based searching. InIFLA, 1997.

[125] Daniel Ramage, Paul Heymann, Christopher D. Manning,and Hector Garcia-Molina. Clustering the tagged

web. InWSDM, pages 54–63, 2009.

BIBLIOGRAPHY 193

[126] Stephen Robertson. Understanding inverse document frequency: On theoretical arguments for idf.Journal

of Documentation, 60:503–520, 2004.

[127] Thomas Roelleke and Jun Wang. Tf-idf uncovered: A study of theories and probabilities. InSIGIR, pages

435–442, 2008.

[128] Kenneth A. Ross and Divesh Srivastava. Fast computation of sparse datacubes. InVLDB, 1997.

[129] Sunita Sarawagi. User-adaptive exploration of multidimensional data. InVLDB, pages 307–316, 2000.

[130] Nikos Sarkas, Albert Angel, Nick Koudas, and Divesh Srivastava. Efficient identification of coupled entities

in document collections. InICDE, pages 769–772, 2010.

[131] Nikos Sarkas, Nilesh Bansal, Gautam Das, and Nick Koudas. Measure-driven Keyword-Query Expansion.

PVLDB, 2(1):121–132, 2009.

[132] Nikos Sarkas, Gautam Das, and Nick Koudas. Improved Search for Socially Annotated Data.PVLDB,

2(1):778–789, 2009.

[133] Nikos Sarkas, Gautam Das, Nick Koudas, and Anthony K. H. Tung. Categorical Skylines for Streaming

Data. InSIGMOD, pages 239–250, 2008.

[134] Nikos Sarkas, Stelios Paparizos, and Panayiotis Tsaparas. Structured annotations of web queries. In

SIGMOD Conference, pages 771–782, 2010.

[135] See homepage for details. Atlas homepage. http://math-atlas.sourceforge.net/.

[136] Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarative information ex-

traction using datalog with embedded extraction predicates. InVLDB, pages 1033–1044, 2007.

[137] Abraham Silberschatz and Alexander Tuzhilin. On subjective measures of interestingness in knowledge

discovery. InKDD, 1995.

[138] Alkis Simitsis, Akanksha Baid, Yannis Sismanis, and Berthold Reinwald. Multidimensional content ex-

ploration.PVLDB, 1(1), 2008.

[139] David A. Smith. Detecting and browsing events in unstructured text. InSIGIR, pages 73–80, 2002.

[140] Fei Song and W. Bruce Croft. A general language model for information retrieval. InCIKM, pages 316–

321, 1999.

[141] Gilbert Strang.Linear Algebra and its Applications. Brooks Cole, 4th edition, 2005.

BIBLIOGRAPHY 194

[142] Russell C. Swan and James Allan. Automatic generationof overview timelines. InSIGIR, pages 49–56,

2000.

[143] Kian-Lee Tan, Pin-Kwang Eng, and Beng Chin Ooi. Efficient progressive skyline computation. InVLDB,

pages 301–310, 2001.

[144] Yufei Tao and Dimitris Papadias. Maintaining slidingwindow skylines on data streams.IEEE TKDE.,

18(2):377–391, 2006.

[145] Peter D. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification

of reviews. InACL, 2002.

[146] Xiang Wang, Kai Zhang, Xiaoming Jin, and Dou Shen. Mining common topics from multiple asynchronous

text streams. InWSDM, pages 192–201, 2009.

[147] Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. Mining correlated bursty topic patterns

from coordinated text streams. InKDD, pages 784–793, 2007.

[148] Neil A. Weiss.Introductory Statistics. Addison Wesley, 7th edition, 2004.

[149] Xian Wu, Lei Zhang, and Yong Yu. Exploring social annotations for the semantic web. InWWW, pages

417–426, 2006.

[150] Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. Top-k set similarity joins. InICDE, 2009.

[151] Lei Xu and Michael I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.

Neural Computation, 8(1):129–151, 1996.

[152] Yusuke Yanbe, Adam Jatowt, Satoshi Nakamura, and Katsumi Tanaka. Can social bookmarking enhance

search in the web? InJCDL, pages 107–116, 2007.

[153] Yiming Yang, Thomas Pierce, and Jaime G. Carbonell. A study of retrospective and on-line event detection.

In SIGIR, pages 28–36, 1998.

[154] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti A. Hearst. Faceted metadata for image search and

browsing. InCHI, 2003.

[155] Xiaohui Yu, Yang Liu, Xiangji Huang, and Aijun An. A quality-aware model for sales prediction using

reviews. InWWW, pages 1217–1218, 2010.

[156] ChengXiang Zhai and John D. Lafferty. A study of smoothing methods for language models applied to

information retrieval.ACM Trans. Inf. Syst., 22(2):179–214, 2004.

BIBLIOGRAPHY 195

[157] ChengXiang Zhai, Atulya Velivelli, and Bei Yu. A cross-collection mixture model for comparative text

mining. InKDD, pages 743–748, 2004.

[158] Qiankun Zhao, Prasenjit Mitra, and Bi Chen. Temporal and information flow based event detection from

social text streams. InAAAI, pages 1501–1506, 2007.