QUERYING,EXPLORING AND MINING THE EXTENDED DOCUMENT by Nikolaos Sarkas A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright c 2011 by Nikolaos Sarkas
203
Embed
QUERYING, EXPLORING AND MINING THE EXTENDED DOCUMENT · Querying, Exploring and Mining the Extended Document Nikolaos Sarkas Doctor of Philosophy Graduate Department of Computer Science
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
QUERYING, EXPLORING AND M INING THE EXTENDED DOCUMENT
by
Nikolaos Sarkas
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Computer ScienceUniversity of Toronto
Collections of digital documents can be used to support information and knowledge discovery applications of
great diversity and richness. Roughly, the functionality built on top of text corpora can be classified intoquerying,
explorationandmining.
Querying: The archetypal mode of interaction with a document collection is through querying. In many applica-
tion scenarios the user has a well defined need for a piece of information. Querying a document collection
involves expressing this information need and satisfying it by locating and ranking a handful of hopefully
relevant documents. For instance, in Web search users express their information need in the form of a
keyword query and as a response receive a ranked list of Web pages deemed to contain the information
requested.
Exploration: An information need is not always precise. Pushing this observation to the extreme, it can be
completely absent and, hence, speculative access to the document collection is required. In such scenar-
ios querying is by definition an ineffective mode of interaction. Exploratory applications compensate by
providing views to the document collection at different levels of granularity and facilitating the progressive
refinement and crystallization of the loosely-defined information need. As an example, consider a hierar-
chical clustering of the documents [34]. This organizationallows a user to navigate the cluster hierarchy, at
each iteration refining or broadening the scope of his view asfeedback about the thematic structure of the
collection is received.
Mining: Document collections can also support a wide range of information extraction, pattern and knowledge
1
CHAPTER 1. INTRODUCTION 2
discovery tasks, usually domain specific. We use the termmining in its broadest sense1 to refer to these
applications. As such, an example of a mining application isInformation Extraction [10, 5, 58], whose aim
is to extract useful, structured information from documents, in the form of entities and their relationships
mentioned in text.
The boundaries between these different modes of functionality are fuzzy. Querying and exploration complement
each other, while both can be improved and supported by utilizing the output of text mining tasks.
1.2 The Extended Document, Dynamic and Integrated Collections
Applications originally adopted the view that documents are by definition equal to their textual component. It was
not uncommon for even the textual structure to be ignored, inwhich case a document was viewed as the multi-set
of the words comprising it. This approach is far from unreasonable and has been the basis for significant progress
on querying, exploring and mining textual data [34, 106]. However, it has become incompatible with the nature,
complexity and quantity of textual content generated on-line at a torrential pace.
The evolution of the Web into an interactive medium that promotes and encourages active user engagement
and co-operation has ignited a huge increase in the amount ofavailable textual data: massive volumes are con-
tinuously posted on-line in the context of blogs2, micro-blogs3 and social networks4, customer feedback portals5,
etc. Such documents are naturally associated with a wealth of information in addition to their textual component,
while advances in Information Extraction and Natural Language Processing technology help highlight additional
document characteristics.
Example 1.1. As a first illuminating example consider the Blogosphere. Millions of bloggers author on a daily
basis in excess of 2.5 million blog posts, where they commentupon a wide variety of private matters, but also
products, people and public on-going events. Powerful Information Extraction machinery [10, 5, 58] can expose
entities mentioned in the text and their relationships.
The collective “discussion” on the Blogosphere takes placein a semi-anonymous manner, as bloggers typ-
ically reveal their demographic profile (age, gender, occupation, location). Novel tools also allow us to infer
additional information about bloggers, such as their standing among their peers [96, 109]. Hence, blog posts can
be associated with the demographic information of their authors.
Finally, blog posts also link to each other, providing us with knowledge about relationships between posts or
1Broader than the scope ofdata miningwhich is usually exclusively associated with pattern discovery.2www.blogspot.com3www.twitter.com4www.facebook.com5www.epinions.com
CHAPTER 1. INTRODUCTION 3
their authors and allowing us to understand how informationis diffused in this massive, global network.
Example 1.2. On-line forums such as customer feedback portals offer unique opportunities for individuals to
engage with sellers or other customers and provide their comments and experiences. These interactions are
typically summarized by the assignment of a numerical or “star” rating to a product or the quality of a service.
Numerous such applications exist, like Amazon’s customer feedback and Epinions6. But even if ratings are not
explicitly provided, sentiment analysis tools [145, 120, 119] can identify with a high degree of confidence the
governing sentiment (negative, neutral or positive) expressed in a piece of text, which in turn can be translated
into a numerical rating.
Example 1.3. Users of on-line, collaborative tagging systems add to their personal collection documents such as
Web pages, blog posts, scientific publications, etc., and associate with each of them a short sequence of keywords
widely known as tags. Each tag sequence, referred to as an assignment, is a concise and accurate summary of the
relevant resource’s content according to the user’s opinion. Given the overlap among the individual collections,
documents accumulate a large number of assignments, each one of them posted by a different individual.
Example 1.4. Micro-blogging services and on-line social networks offera remarkably similar model of interac-
tion – users post short text messages or URLs which are pushedto their friends or followers. The diversity of
non-textual information associated with a piece of text oneor two sentences long is impressive: the exact time of
authorship, links to other web pages or messages, author demographic information, the underlying social network
of the sender, sentiment, entities mentioned, even the precise location of the author when a mobile device was
used to publish it.
The wealth of non-textual information associated with content published on-line is redefining the way we
utilize and interact with documents. The value of suchmeta-data is not just a premise. Their use has already
given birth to novel and improved applications. There is no need to conjure up esoteric examples of such ad-
vances. Consider how thehyperlinkrevolutionized Web search and mining. Its creative use has enabled improved
Web search quality [31], personalized Web search [84], the ability to identify Web communities [94], study the
diffusion of information [1], and much more (Chapter 2).
In a similar manner, documents published on-line and associated with tags, entities, sentiment, author infor-
mation, etc., are being queried, explored and mined in ways and accuracy not possible for vanilla documents.
Meta-data are not simply ancillary information to the document’s textual component, but an integral part of what
the document is and the information it carries. This renewedview of the document as a datum that does not begin
and end with its textual component, but is extended to include as equally valuable and important the meta-data
associated with it, gives rise to the concept of theextended document.
6www.epinions.com
CHAPTER 1. INTRODUCTION 4
The same developments that gave birth to the extended document have also rendered obsolete the view of
document collections as immutable and isolated from other valuable data collections, textual or structured.
Example 1.5. At the time of writing, people around the world author on a daily basis in excess of 2.5 mil-
lion blog posts (30/sec), 55 million micro-blog posts (tweets, 630/sec), and 900 million social status messages
(10400/sec). New content is generated in a never ending, streaming fashion, continuously expanding and en-
riching the corresponding document collections. Additionally, the focus and characteristics of newly generated
content is constantly evolving, reflecting and shaping events, public opinion and memes7.
Example 1.6. Data sources publicly available on-line exhibit great diversity. The Web is evolving from a col-
lection of web pages to a federation of document collections, each one with unique characteristics. Web pages,
blog posts, tweets, social status messages, wikis, forums,user reviews, etc., form conceptually distinct document
collections. However, they do not exist in isolation from each other: they co-exist, co-evolve and interact. Besides
textual data, structured data are becoming an integral partof the Web, whether they lie in public “deep web”
databases [23] or correspond to knowledge about entities and their relationships extracted from Web textual data.
As the above examples illustrate, document collections arefar from static and isolated. They are “alive”;
growing and evolving; co-existing and interacting. They are dynamic and integrated. An impressive gamut of
novel applications for querying, exploring and mining textual data is enabled by utilizing the evolving nature of
document collections and leveraging the synergies betweendifferent data collections.
One could even argue that utilizing these emerging characteristics of documents and document collections
is not simply an opportunity, but rather the only possible way forward. The amount and diversity of textual
data available on-line is immense and the pace at which it is generated is accelerating, as more people become
comfortable with participating in on-line activities. Thecontent’s focus and characteristics change as fast as the
world around us and novel applications continuously join the existing ecosystem and generate new types of textual
content and meta-data.
It is not clear that textual data at this scale, complexity and rate of change can be interacted with and made
sense of by ignoring everything but their textual component. Would searching for information within billions
of Web pages be possible without the help of link analysis? Similarly, can search and exploration of billions of
blogs, tweets, social status messages, user reviews, etc.,be possible without utilizing notions of time, social or
demographic proximity, and distillations such as entitiesand sentiment? Is effective search for information and
knowledge possible if we only focus on a thin slice of the available data at a time?
The growth of on-line user participation, with all its economic and social benefits, is silently supported by
techniques and applications that intermediate and connectusers producing and consuming information. Without
7A postulated unit of cultural ideas, symbols or practices [49].
CHAPTER 1. INTRODUCTION 5
effective solutions for searching, exploring and mining textual data, this continuing growth is far from granted.
For all the above reasons it is becoming essential to unlock the information contained in meta-data; to develop
techniques that handle rapidly changing corpora; to push the boundaries of the data sets utilized by our applica-
tions.
Nevertheless, developing successful applications is not astraightforward enterprize, as we are presented with
formidable challenges. We can identify two main sources of complexity: one is conceptual and one is computa-
tional.
First, utilizing information about documents in addition to their textual component introduces increased con-
ceptual complexity. Querying, exploring and mining extended documents demands that we reason about the
relations and dependencies among text and meta-data. The same is true of integrated document collections. The
relations among diverse types of data is rarely straightforward and deterministic. The challenge is compounded
by the great variance in the data attributed to the fact that content is increasingly the result of social user activity
rather than professional authorship.
Second, the size of available document collections is immense, while new textual content is being generated at
a staggering pace. Yet, applications such as querying and exploration demand interactive response times. Mining
tasks need also be highly efficient and, where possible, amenable to incremental computation, since the dynamic
nature of document collections requires their frequent application.
1.3 Contributions and Outline
Motivated by these developments, the present thesis contributes to a growing body of work that seeks to develop
novel and improved techniques for querying, exploring and mining textual data by exploiting the extended nature
of documents and document collections. We present five solutions for interactively querying and exploring doc-
ument collections. Although we do not present a stand-alonemining application, the solutions presented will be
frequently supported by data mining tasks.
The contributed techniques are “distinct” from each other in the following sense. Note that by the very notion
of the extended document, a technique cannot be general enough to be applicable toanytype of document. Each
approach needs to focus on creatively and efficiently utilizing different types of meta-data, available in different
application domains. This is also true for integrated document collections. Different approaches might be needed
based on the nature of co-existing data collections and the task at hand.
This observation holds true for the solutions to be presented: each one is applicable to a different class of
extended documents. Nevertheless, the fact that they are not, and cannot be, completely general, does not imply
that their scope is limited. For instance, we discuss (Chapter 3) an approach for interactively exploring a collection
CHAPTER 1. INTRODUCTION 6
of extended documents whose associated meta-data are entities extracted from text and categorical attributes.
While the presence of entities and categorical attributes is a prerequisite for applying the technique, it is applicable
to all extended document collections with these characteristics.
In addition, our contributions share two core traits. First, they employ to the extent possible principled proba-
bilistic reasoning. Previously, we identified the complexity of on-line content as a major challenge in developing
successful solutions. The use of probabilistic reasoning,rather than ad-hoc heuristics, in explicitly or implicitly
capturing the user behavior and activities giving rise to this complexity, allows us to effectively address it. Sec-
ond, they were designed with the explicit goal of being highly efficient and applicable to vast, real-world data
collections. We invest considerable effort in validating this claim, by presenting extensive experimental results on
real sets of data.
More specifically, with respect to exploring document collections, we make the following contributions. Part
of this work also appears in [130, 9, 131].
Extended Documents with Entity Mention and Categorical Meta-data [130, 9]
The vast amount and great diversity of user generated content published on-line necessitates novel paradigms for
its understanding and exploration. To this end, we introduce an efficient methodology for discoveringstrong entity
associationswithin all theslices(categorical meta-data value restrictions) of a document collection. Since related
documents mention approximately the same group of core entities (people, locations, companies, products, etc.),
the entity associations discovered can be used to expose underlying themes within each slice of the document
collection. This and other relevant information can be interactively presented on demand as one “drills-down” to
a slice of interest, or compared for different slices.
We devise efficient algorithms capable of addressing two flavors of the core problem: algorithm THR-ENT for
computing all sufficiently strong entity associations (Threshold Variation) and algorithm TOP-ENT for computing
the top-k strongest entity associations (Top-k Variation), within each slice of the extended document collection.
Algorithm THR-ENT eliminates from consideration provably weak associations, while TOP-ENT supports early
termination. A unique characteristic of algorithms THR-ENT and TOP-ENT is their ability to accommodate any
plausible alternative for quantifying the degree of association between entities. This trait enables the use of
complex but robuststatistical correlationmeasures that are most appropriate for text mining tasks. Finally, the
application of the algorithms and their variations on all the slices of the collection is supported by an efficient and
nimble infrastructure that exploits slice overlap.
The efficiency and applicability of the proposed techniquesto massive document collections is demonstrated
by means of a thorough experimental evaluation that employssynthetically generated and real world data, com-
prised of millions of documents. The utility in this contextof statistical correlation measures, as was previously
CHAPTER 1. INTRODUCTION 7
observed in the literature, is verified through the public implementation of the proposed functionality [9] and a
smaller scale demonstration on real data.
Extended Documents with User Rating/Sentiment Meta-data [131]
In the context of on-line social activity, it is common for users to express their views and opinions on products,
services, etc., in reviews both formal (retailer or specialized Web sites) and informal (blogs, forums). These
interactions are typically summarized with a numerical or “star” rating, which can be explicitly provided by the
user or detected by means of a sentiment analysis tool. Hence, user reviews can be viewed as extended documents
associated with a numerical rating from a small domain.
The information contained in reviews is invaluable to otherusers, but plain keyword search is an ineffective
approach for interacting with, exploring and digesting it.We enable richer interaction by supporting the progres-
sive refinement of a query result in a data-driven manner, through the suggestion of expansions of the query with
additional keywords. The suggested expansions allow one tointeractivelyexplorethe reviews, by focusing on a
particularly interesting subset of the original result set. Examples of such refinements are reviews discussing a
certain feature of the product described in the original query, or subsets of reviews with high or low on average
ratings.
To offer this novel functionality at interactive speeds, weintroduce a framework that is computationally ef-
ficient and nimble in terms of storage requirements. For a user query, the “interestingness” of each candidate
expansion is quantified by means of a fairly general scoring function, instantiated to produce expansions with
the desired characteristics. The top-k highest scoring ones are presented. Our solution utilizes the principle of
Maximum Entropy to efficientlyestimatethe score of a candidate expansion, instead of performing a demanding
exact computation. It is further improved by utilizing Convex Optimization principles that allow us to exploit the
pruning opportunities offered by the natural top-k formulation of the problem.
Performance is evaluated using both synthetic data and large real data sets comprised of blog posts. Our
results indicate that the improvement gains are such that the application of our solution in an interactive scenario
is feasible.
With respect to querying document collections, we make the following contributions. Part of this work also
appears in [132, 133, 134].
Extended Documents with Tag Meta-data [132]
Through a number of on-line applications, users have the ability to associate documents with sequences of de-
scriptive keywords, referred to as tags. Each tag sequence is a concentrated description of the documents content
CHAPTER 1. INTRODUCTION 8
according to a user. For each document we can have a significant number of such individual opinions. This in-
formation is extremely valuable for a number of applications, including search. Intuitively, Information Retrieval
algorithms attempt to automatically identify what are the keywords that are relevant to a document. With tags, we
have a large number of human users collaborating to generatethis information.
Previous work on searching socially annotated document collections presented ad-hoc approaches to utilizing
tags. Such approaches fail to leverage our growing understanding of the structure and dynamics of the tagging pro-
cess. Efficiency issues were also left unaddressed. Instead, we introduce a principled probabilistic methodology
for determining the relevance of a query to a document’s tags. Our solution utilizesinterpolatedn-gramsto model
the tag sequences associated with each document. The use of interpolatedn-gram models exposes significant and
highly informative tag co-occurrence patterns (correlations) present in the user assignments. The training and
incremental maintenance of the interpolatedn-gram models is performed by means of a novel constrained opti-
mization framework that employs powerful numerical optimization techniques and exploits the unique properties
of both the function to be optimized and the parameter domain.
We orchestrated a large scale experimental evaluation involving a large crawl of the most popular social
annotation application and tens of independent human judges. Our results demonstrate significant improvement
in both retrieval precision (+30%) and optimization efficiency (4×).
Dynamic Collection of Extended Documents with Partially-Ordered Categorical Meta-data [133]
The skyline of an extended document collection annotated with categorical attributes can be used to single-out
documents that possess a uniquely interesting combinationof attribute values that no other document can match,
by having more preferable values in all its attributes. The subsequent uses of the skyline documents are many and
depend on the application. In addition, the skyline is a valuable concept not only for static document collections,
but also, perhaps more so, for streaming, dynamic document collections.
The are two main challenges in computing the skyline of a dynamic collection. First, recomputing the skyline
from scratch each time a new document arrives or an old one expires is wasteful. Instead, the skyline needs to
be incrementally maintainedafter such updates. Second, extended document meta-data are typicallycategorical
in nature and, hence, preferences can define apartial orderingof their domain. This flexibility complicates the
“comparison” of two documents’ values.
In order to address these challenges, we adopt a generic skyline maintenance framework and design its two
building blocks: A grid-based data structure for indexing the most recent documents of the collection and a data
structure based on geometric arrangements to index the current skyline documents.
Our experimental evaluation reveals one order of magnitudeperformance improvement over the adaptation of
a competing alternative that was originally developed for static data, as well as improved scalability in terms of
CHAPTER 1. INTRODUCTION 9
the number of extended document attributes.
Document Collection Integrated with Structured Data Sources [134]
Search engines are increasingly utilizing diverse sourcesof information, in addition to their Web page index,
in order to better serve user queries. One such source are structured data collections, which we can abstract as
relational tables. Results in response to keyword queries related to products (“50 inch samsung lcd”), movie
showtimes (“inception toronto”), airlines schedules (“morning toronto to cuba flights”), etc., are augmented by
presenting information directly retrieved from structured data tables (e.g., entries from tables TVs, Movies and
Flights respectively for the above examples).
In order to integrate structured data tables into Web searchin this manner, each query needs to be analyzed so
that highly plausible mappings of the query to a table and itsattributes are identified. The challenge in performing
this analysis are twofold. First, a search engine can maintain thousands of structured data tables that are implicitly
and indirectly queried by users that demand response times in the order of milliseconds. The Web query analysis
overhead needs to be minuscule. Second, the intent behind free-text Web queries is rarely clear and unambiguous.
Determining which, if any, tables and their attributes are relevant to a query – with high precision and recall – is
hard.
To address these challenges we introduce a fast and scalablemechanism for obtainingall possible mappings
of a query to the structured data tables. A probabilistic scoring mechanism subsequently estimates the likelihood
of the user intent captured by each mapping and deploys a dynamic, query-specific threshold to eliminate highly
unlikely mappings. The probability estimates utilized in these computations are mined off-line from the structured
and query log data. The techniques are completely unsupervised, obviating the need for costly manual labeling
effort.
The effectiveness and efficiency of our techniques is evaluated using real world queries and data. Overall, it
offers high precision (90%) with good recall (40%), while imposing on average sub-millisecond overhead even in
the presence of 1000 structured data tables.
Outline
The remainder of the thesis is organized as follows. In Chapter 2 we review work on querying/exploring/mining
extended documents, dynamic and integrated document collections, and explore in more detail the background
on which the functionality subsequently presented lies. Part I of the thesis follows (Chapters 3 and 4) where
we introduce our techniques for exploring document collections. Subsequently, Part II (Chapters 5, 6 and 7)
introduces our querying techniques. We offer our closing thoughts in Chapter 8.
Chapter 2
Related Work
The techniques that we subsequently present are part of a growing body of work that seeks to improve querying,
exploration and mining of extended documents, dynamic and integrated document collections. In this Chap-
ter we review part of this activity with the goal of further clarifying the context in which this thesis’ technical
contributions belong.
Most of the existing work on processing extended documents is typically built around a single type of meta-
data. Hence, we choose to structure our presentation of it around the type of meta-data used: links (Section 2.1),
categorical attributes (Section 2.2), tags (Section 2.3),extracted entities (Section 2.4) and user feedback (Section
2.5). In addition, we review work on dynamic (Section 2.6) and integrated (Section 2.7) document collections.
2.1 Link Analysis and Social Networks
One of the first pieces of non-textual information associated with documents to be successfully utilized arehyper-
linksor simply links. Links between documents can be explicit, such as hyperlinks connecting two Web pages, or
implicit when document sources are connected instead, as inthe case of content authored in the context of on-line
social networks.
The first attempts to leverage the elaborate link structure among documents were in the context of Web search
and ranking. The PageRank [31] and HITS [90] algorithms developed in this context are among the most success-
ful. The premise behind utilizing intra-document links is that the relevance of a keyword query to a document’s
textual component cannot be the sole guide in determining the top results to be presented in response to the query,
since a significant number of documents can be highly relevant. Instead, theauthorityor importanceof a Web
page should also be an important factor. As Kleinberg [90] argues, a link from a web pagep to a web pageq
confers, in some measure, authority onq. Hence, links can be used in order to automatically infer a document’s
10
CHAPTER 2. RELATED WORK 11
importance.
The PageRank algorithm maps the document collection into a graph: documents comprise the graphs nodes,
while directed edges among nodes are created whenever a linkexists between the corresponding documents. The
intuition behind the algorithm is that highly authoritative documents should be heavily linked by other highly
authoritative documents. This recursive definition of authority conferred by means of links enables an elegant
approach for computing theglobalauthority of documents. The document graph is viewed as a Markov chain and
authority propagation is equivalent to a traversal of this Markov chain: frequently visited chain states (authoritative
documents) are linked by other frequently visited states (other authoritative documents). Hence, the steady state
distribution of the chain can be used a proxy of the corresponding document authority.
The profound success of the PageRank algorithm led to the development of numerous variants. While PageR-
ank computes the global importance score of a document, Topic-Sensitive PageRank [71] computes a document’s
importance within a subset of documents determined to be relevant to a particular topic. On the other hand, Per-
sonalized PageRank [84] computes user-specific importancescores. Both algorithms utilize the Markov chain
formulation of the PageRank computation problem. They biasthe steady state distribution of the chain towards
the desired subset of states (documents) by adding a biased “teleportation” component to the Markov Chain: at
each step, with probabilitya a state outlink is used to continue the chain traversal, while with probability1− a a
jump towards the biased states occurs.
Besides inferring document importance, intra-document links have been used to combat web spam. Given
that search engines utilize link structure in order to assess the importance of web pages, spammers have an
incentive to create clusters of pages densely linking each other. Due to the mechanics underlying the PageRank
and HITS algorithms, undesirable pages in such clusters would receive high importance scores. [68] presents
an algorithm facilitating the discovery of such link spam. Their algorithm uses the importance scores computed
by two applications of the PageRank algorithm. First, PageRank is normally applied. Then, it is reapplied but
this time using a random teleportation component biased towards a seed of pages that are known not to be spam.
Since good pages rarely link to spam pages, the second application of PageRank should be biased towards good
pages. Pages that exhibit considerable divergence among the scores produced by the two PageRank applications
are candidates for being spam.
Web community detection is a text mining application made possible by link analysis. By utilizing the links
between user-generated content (blogs, tweets, etc.), communities of authors can be discovered. [94] is a pioneer-
ing attempt to organize Web pages into communities. Based onthe observation that communities should include
a core, i.e., a full bipartite graph of hub pages linking to authoritative pages, an efficient algorithm is developed
that hunts for such cores. In a sense, community detection can be viewed as a highly focused document clustering
technique, capable of identifying focused clusters of documents based on their linking patterns instead of their
CHAPTER 2. RELATED WORK 12
textual similarity. Among the possible applications enabled by community detection is exploration, since the
identified communities can serve as an entry point for exploring the Web.
Identification of “influential” individuals is an importanttask with numerous applications, including querying.
For instance, Mathioudakis and Koudas [109] identify query-dependent influential blogs, referred to as starters.
Given a query, e.g., “politics”, their goal is to identify blogs whose posts relevant to “politics” receive a sig-
nificantly higher number of links, when compared to the number of outlinks that they contain. Hence, starters
are influential blogs with respect to the particular topic specified by the query. The efficient identification of
query-specific starter blogs is supported by sampling the blog graph inferred by the posts relevant to the query
and their links. The proposed technique allows the approximate computation of the most influential blogs without
processing the entire graph.
Sociology-inspired approaches for studying “influence” require the explicit knowledge of an underlying social
network. For instance, [89] studies the problem of identifying the influential individuals that should be targeted
in order to maximize the spread of an “idea” in the social network. However, information about the underlying
social network is not always available. In such scenarios, linking patterns between user-generated content (e.g.,
blog posts) can help us make inferences about how influentialindividuals or web sites are among their peers.
Similarly, [1] attempt to reconstruct the underlying “influence network” between blogs. While information
spreads through this implied network, evidence of “infection” are not always available. For example, while
a blogger might first view a Youtube video on a different blog and re-publish it on his own, he will not always
provide a link to the blog where he first encountered this information. In order to identify such implicit information
pathways between blogs, [1] train a number of classifiers that are used to determine whether an information link
exists between two blogs.
2.2 Faceted Search
A widely deployed technique that utilizes hierarchical categorical meta-data in order to support enhanced querying
and navigation of textual data collections isFaceted Search, proposed by Pollit [124] and subsequently by Yee
et al. [154]. The faceted search model, which adds exploratory capabilities to the plain keyword search model,
assumes extended documents whose meta-data are orthogonalattributes with hierarchical domains, referred to
as facets. In their presence, the result of a vanilla keyword query canbe refined by “drilling-down” the facet
hierarchies. This interactive process places and gradually tightens constraints on the attributes, allowing one to
identify and concentrate on a fraction of the documents thatsatisfy a keyword query. This slice of the original
result set possesses properties that are considered interesting, expressed as constraints on the document meta-data
attributes.
CHAPTER 2. RELATED WORK 13
Perhaps the most well known application of faceted search ison-line stores such as Amazon: a user keyword
query on the store’s product database can potentially retrieve thousands of product pages (extended documents)
that can be refined by Product Type (e.g., Book, Movie), Price, etc. In this context, attributes Product Type
and Price are two independent facets, whose domain is organized in hierarchies (e.g., Product Type⇒ Book⇒
Fiction), that can be traversed in order to refine the original query result.
The facet domains and their hierarchical organization can be meta-data naturally associated with the docu-
ments, can be set manually by an expert or automatically extracted from the document collection on indexing
time. The automatic extraction of multiple orthogonal facets and their associated hierarchies from the document
collection is the focus of work conducted by Dakka et al. [47,46].
In [47] a limited,supervisedapproach for extracting facets and associating documents with them is presented.
The approach assumes a training set of well-defined facets and words associated with the them (e.g., facet “Ani-
mals” and words “cat”, “dog”). Each document is processed and a set of descriptive keywords (nouns) is extracted
from it. These can be considered as the document “meta-data”or descriptive attributes. The keywords are also
extended with their WordNet hypernyms (words with similar meaning) and a classifier is used to assign each
extracted keyword to one of the training set facets. This first stage identifies the facets from the well-specified
training set that are relevant to the particular document collection and associates the document keywords (meta-
data) with facets. The keywords of each facet are subsequently organized into a hierarchy by utilizing keyword
co-occurrence patterns in order to identify subsumption/equivalency relations among the keywords.
Theunsupervisedtechnique presented in [46] builds upon the ideas presentedin [47], although it is primarily
applicable to documents rich in named entities (e.g., “Hillary Clinton”, “Microsoft”, etc.) such as new articles. A
sophisticated array of algorithms is applied to identify named entities within documents. Then, Wikipedia data is
used to identify the broader “category” associated with an named entity, i.e., “Hillary Clinton” is a “Person” and a
“Politician”. The few of these categories are automatically singled out and organized in independent hierarchies.
A few attempts have been made to extend and further improve the basic faceted search model. [21] argues that
while facets are invaluable for refining keyword queries, existing solutions provide too little information in order
to help users select an appropriate refinement. The only information provided, besides a small sample of possible
refinements per facet, is the number of documents contained in the refined result set. For example, if a user queries
an on-line store with keyword “digital camera”, the resultscan potentially be refined by Manufacturer, such as
“Nikon” or “Canon”. The only guide provided for selecting “Nikon” or “Canon” is that 20 cameras are made by
“Nikon” and 30 by “Canon”. Instead, considerably more and useful information can be presented by considering
additional document meta-data, such as the average rating of Nikon and Cannon products, their average price and
so on. Besides introducing this and other improvements, [21] discuss their efficient implementation using only
CHAPTER 2. RELATED WORK 14
regular, unmodified and freely available1 document retrieval systems (inverted indices).
Evidently, one of the limitations of faceted search paradigm is its reliance on a handful, well-defined facet
hierarchies. This tends to render the approach inapplicable to domains that exhibit high content variance, since no
meaningful, universal facets exist. In Chapter 4 we presenta technique similar in spirit to faceted search which
utilizes instead more accessible meta-data, such as sentiment extracted from text.
2.3 Social Annotation
Social annotation, also referred to ascollaborative tagging, has been constantly building momentum since its
recent inception and has now reached the critical mass required for driving exciting new applications. For instance,
on December 2008 del.icio.us2, one of the many applications using collaborative tagging,reported 5.3 million
users that have annotated a total of 180 million URLs. Given that the user base and content of such sites has been
observed to double every few months, these numbers only loosely approximate the immense popularity and size
of systems that employ social annotation.
Users of an on-line, collaborative tagging system add to their personal collection a number of documents (e.g.,
Web pages, scientific publications, etc.) and associate with each of them a short sequence of keywords, widely
known astags. Eachtag sequence, referred to as anassignment, is a concise and accurate summary of the relevant
document’s content according to the user’s opinion. The premise of annotating documents in that manner is the
subsequent use of tags in order to facilitate the searching and navigation of one’s personal collection.
As an example, del.icio.us users add to their collection theURLs of interesting Web pages and annotate them
with tags so that they can subsequently search for them easily. Users can discover and add URLs to their collection
by browsing the web, searching in del.icio.us or browsing the collections of other users. Given the considerable
overlap among the individual collections, documents accumulate as meta-data a large number of assignments,
each one of them posted by a different individual.
Research on collaborative tagging has mainly followed two directions. One direction focuses on utilizing this
newly-found wealth of information in the form of tag meta-data to enhance existing applications and develop new
ones. The second attempts to understand, analyze and model the various aspects of the social annotation process.
With respect to searching for and ranking extended documents with tag meta-data, Hotho et al. [76] propose
a static, query-independent ranking of the documents (as well as of users and tags) based on an adaptation of
the PageRank algorithm [31]. Users, documents and tags are first organized in a tripartite graph, whose hyper-
edges are links of the form (user,document,tag). This graphis then collapsed into a normal undirected graph
1For instance, Lucene.2www.delicious.com
CHAPTER 2. RELATED WORK 15
whose nodes represent indiscriminately users, documents and tags, while edge weights count co-occurrences of
the entities in the hyper-edges of the original graph. The PageRank algorithm is then applied, producing a total
ordering involving all three types of entities.
Yanbe et al. [152] use tags to improve document ranking by combining into the ranking function many features
extracted from tags. Such features include the similarity of the query to tags, as well as the number of user
assignments and their recency. Intuitively, the last two features are indicative of the document’s quality and
freshness, suggesting an authority measure similar to PageRank, but less susceptible to spam (Section 2.1).
Bao et al. [17] adapt a machine learning approach to ranking in the case where the extended documents are
Web Pages. A support vector machine is used to “learn” the ranking function [85] which weighs five different
features of the pages: the tf/idf similarity of the page and the query [106], two different similarity measures
between the query and the tags, its PageRank and the PageRankadaptation that was presented in [76].
Amer-Yahia et al. [7] propose a solution for ranking documents efficiently, under the constraint that only the
assignments posted by users in social network neighborhoods are to be used, thus personalizing query results
based on the social network of the user submitting the query.The proposed technique is a general framework that
can be used in combination with a variety of ranking functions monotonic in the frequency of tags appearing as
query terms.
In Chapter 5 we present our own approach to using tags in orderto improve ranking of extended documents.
Unlike the research presented above, it utilizes tags in a principled probabilistic manner, motivated by our growing
understanding of the social annotation process ([64, 70, 33, 73] presented bellow), while being capable to cope
with the scale and rapid growth of social annotation systems.
Besides ranking, researchers have also looked into other interesting problems related to collaborative tagging,
including facilitating the exploration of the socially annotated document collection. Li et al. [97] organize the tags
in a loose hierarchical structure in order to facilitate browsing and exploration of tags and documents. Ramage et
al. [125] extended the Latent Dirichlet Allocation technique [27] for extracting topics from textual data to include
tags and used it to derive an improved clustering of web pagesinto thematic categories.
Another body of work is concerned with the analysis and modeling of collaborative tagging systems [64, 70,
33, 73]. [64, 70] observed that the distribution of tags assigned to a document converges rapidly to a remarkably
stable heavy-tailed distribution. [64] concentrates on identifying the user behavior that leads to this phenomenon,
while [70] attempts to mathematically model it.
[33] on the other hand, explores and models the co-occurrence patterns of tags across documents. They found
that given a tag, the tags that tend to be used in conjunction with it follow a stable heavy-tailed distribution:
the more specific that tag in question is, the flatter the tail of the distribution. The authors contribute this to a
hierarchical organization of the tags co-occurring with the one singled-out. This observation points to the utility
CHAPTER 2. RELATED WORK 16
of tags for exploring relevant collections of extended documents.
Heyman et al. [73] investigate whether the additional information provided by the social annotations has the
potential to improve Web Search and reach mostly positive conclusions. They considered the del.icio.us social
annotation system where users tag Web pages. Among their important positive conclusions is that the Web pages
present in del.icio.us are interesting, fresh and activelyupdated and that judges in their user study found tags
to be both relevant and objective. Their two most significantnegative observations is that the pages present in
del.icio.us cover a tiny fraction of the Web overall and thattags also tend to present in the Web page text.
2.4 Information and Entity Extraction
The use of Information Extraction (IE) technology can expose vast amounts of information embedded within
the textual component of documents and enable profoundly more sophisticated ways of interacting with such
collections of extended documents.
IE systems are (usually sophisticated) algorithms capableof identifying structured information in unstructured
textual data. Recent tutorials [10, 5, 58] provide an excellent overview of the technology supporting information
extraction, both rule-based and machine learning-based. Typically, an instantiation of a particular IE system
is able to identify a single, well specifiedrelation. For example, an IE algorithm can be tuned and trained to
scan news articles and retrieve concert information. Such information can be described by atuplewith schema
〈Band Name, City, Venue, Date〉. A specialized IE application is Named Entity Extraction, whose goal is to
identify mentions of named entities such us people’s names,corporation names, products, etc.
Entities extracted from text enable the Entity Search querying model [42]. The model is motivated by the
observation that many queries issued on a document collection, such as the Web, do not seek a specific document
for browsing, but rather information which can be present inmultiple documents or scattered across documents.
Example queries are “amazon customer service phone” or “university of toronto professors”. The latter query
searches for entities (professors) mentioned in Web pages containing terms “university of toronto”. Hence, in the
Entity Search model, queries are comprised of both desired entity types and keywords. As a response, entities
matching the query and supporting Web pages are presented. In [42], Entity ranking is based on a probabilistic
framework, where documents matching the keyword portion ofthe query and containing a requested entity provide
“evidence” in its favor.
Similarly, [8] utilizes both entities extracted from text and known associations between them to offer sophis-
ticated querying functionality. For instance a user can query for “pet-friendly hotels in a lively city”. The struc-
tured information supporting this query are tuples〈Hotel, City〉. But abstract attributes such as “pet-friendly”
and “lively” cannot be exposed using IE techniques and instead standard information retrieval techniques need to
CHAPTER 2. RELATED WORK 17
be applied to retrieve documents related to these terms. Allthese diverse pieces of structured and unstructured
information are efficiently stitched together to provide the top Hotel,City pairs matching the desired attributes, as
well as documents supporting this claim.
Besides querying, entities have also been used for mining textual data. [142, 139] utilize entities extracted from
documents in order to identify “topics” present in a document collection. Topics are detected by first computing
pairs of correlated entities and then further grouping these entity pairs into clusters. The assumption underlying
this approach is that such groups of entities correspond to an underlying event. In Chapter 3 we build upon
the principles presented in [142, 139] to provide interactive exploratory functionality for extended document
collections which in addition to entities, are associated with categorical meta-data attributes.
Unlike other extended document meta-data, the generation of entities and their relationships is not a byproduct
of user activity, but an extremely demanding computationaltask. Each instantiation of an IE algorithm is designed
to extract a particular relation from text. Normally, numerous such IE “black-boxes” need to be applied in order
to extract useful and diverse information from documents. The development and debugging of such IE black-
boxes, the combination and reconciliation of their outputs, as well as their efficient application on large document
collections is a herculean task. Many systems are currentlyunder development whose goal is to manage and
optimize this entire process.
In this spirit, [136] propose an approach for developing a complex IE program using the Datalog language to
combine small and highly specific IE “predicates”. The advantages of this approach are twofold. First, developing
highly specific and targeted IE algorithms allows for more effective and focused development. Second, stitching
together of these small and specific IE operators into a larger and more complex program using Datalog rules
enables their optimized execution: since the high level IE program requires the application of many smaller IE
operators and joining of their output into more complex relations, the order in which they are applied is crucial for
performance. Hence, [136] focuses on the enumeration of possible execution strategies and the use of statistics
and cost models to identify the most efficient one.
While the focus of [136] is on efficiency, the quality of the generated data is also of paramount importance.
IE algorithms offer limited precision and recall. Motivated by this observation, Ipeirotis et al. [82] develop an
optimization framework for the efficient extraction of a single relation from a text collection, at the desired recall
level (% of tuples recovered). They identify that there exist four possible execution strategies for applying an IE
algorithm on a document collection, with unique recall/execution cost characteristics. A sophisticated cost/recall
estimation process is developed and used in the context of anadaptive query execution framework: the system
initiates the execution with what a-priori appears as an optimal plan, and as more accurate statistics about the
corpus are gathered during execution, adaptively switchesto a more efficient plan at the desired recall level.
CHAPTER 2. RELATED WORK 18
2.5 Sentiment and User Feedback
In the context of active user participation in on-line activities, it is common for users to express, either explicitly
or implicitly, their views and opinions on products, events, etc. For example, on-line forums such as customer
feedback portals offer unique opportunities for individuals to engage with sellers or other customers and provide
their comments and experiences. These interactions are typically summarized by the assignment of a numerical
or “star” rating to a product or the quality of a service. Numerous such applications exist, like Amazon’s customer
feedback and Epinions. Any major online retailer engages one way or another to consumer-generated feedback.
But even if ratings are not explicitly provided, sentiment analysis tools [145, 120, 119] can identify with a high
degree of confidence the governing sentiment (negative, neutral or positive) expressed in a piece of text, which in
turn can be translated into a numerical rating. This capability enables the extraction of ratings from less formal
reviews, typically encountered in blogs. Extending this observation, such tools can be employed to identify the
dominant sentiment not only towards products but also events and news stories. Virtually any document can be
“extended” by associating it with a rating signifying the author’s attitude towards some event.
Conversely, users not only voluntarily express their opinion, but also actively seek such information made
available by fellow users. Study of feedback provided by complete strangers on-line is an integral part of Internet
users’ decision making process [119]. There is “safety in numbers” that neither a limited number of personal
acquaintances nor professional critics can provide. Hence, facilitating access to this information is a pressing
need.
Aspect summarization enables better understanding of userreviews. Aggregating user ratings or sentiment to
provide an overall rating towards a product, service or event is informative but masks finer granularity patterns.
Most rated items have certain aspects that users like and other aspects that they dislike. A high-rated restaurant
can have “great food”, but “slow service”. The goal of aspectsummarization is to expose these facets together
with the aggregate sentiment towards them. Evidently, aspect summarization is closely related to faceted search
and automatic facet discovery (Section 2.2).
Aspect summarization techniques typically employ machinelearning tools applied off-line at a collection of
reviews. As an example, Yue et al. [103] extract rated aspectsummaries from a collection of extended documents
with ratings as their meta-data. The authors use a variationof Probabilistic Latent Semantic Analysis (PLSA) [75]
to identify aspects as PLSA topics comprised of “modifier” phrases, such as “good service”. Overall document
ratings are then used to associate each aspect with its individual rating.
In Chapter 4 we present a tool that is similar in spirit to the rated aspect summarization technique of [103].
However, “rated aspects” are computed on-the-fly, rather off-line, in response to an ad-hoc keyword query and
suggested as possible expansions of the original query.
CHAPTER 2. RELATED WORK 19
Another valuable task is singling-out high quality reviews. [4] consider many document (review) features,
such as text quality, and document meta-data such us hyperlinks and user ratings (votes on the review helpfulness)
in order to develop a classifier capable of identifying high quality documents. A relevant approach is adopted in
[61]. There, besides linguistic features, document sentiment is used to predict its perceived usefulness. Intuitively,
the subjectivity of the document affects its quality.
Evidently, user reviews posted on-line are not only relevant for fellow users seeking information, but for
manufacturers, analysts, etc. Reviews and their associated ratings or sentiment can both influence and predict
sales. [61] develop a model to estimate the effect of a reviewon subsequent sales, taking into account its rating,
its quality and other features. [155] build an autoregressive model which uses sales at timest−1, t−2, . . ., as well
as review ratings, quality and sentiment at these time instances to predict sales at timet. Similarly, [13] develop a
linear regression model to predict movie box office results using micro-blogging messages from Twitter and their
associated sentiment.
2.6 Dynamic Document Collections
Textual content is being generated on-line at a torrential pace (Example 1.5). There is a constant flow of new
documents being added to document repositories. Furthermore, this “stream” of information is non-stationary. Its
focus and characteristics shift over time, in a gradual or abrupt manner.
The mining task of Topic Detection and Tracking [153] seeks to extract time evolving topics from a dynamic
document collection. For instance, the topics mined from a blog post collection can correspond to active discus-
sion about public events. The literature on this area is vast. Some approaches, such as [147] adopt a probabilistic
approach and identify a topic as an evolving word distribution, in the spirit of PLSA [75]. Others, such as [16]
perform at each time period (e.g., single day) a clustering of correlated words or entities. Topics are identified
as persistent word clusters, allowing for changes in cluster structure over time. Recent work utilizes, in addition
to the documents’ textual component and timestamps, information about the underlying social network in the
context of which content is generated [158].
The appearance of a new theme or topic in a dynamic document collection is rarely gradual. Typically, the
emergence of a new topic is signaled by a burst in activity. Kleinberg [91] detects for each word escalating bursts
using a Hidden Markov Model, whose hidden states correspondto levels of “burstiness” for this word. Bursts are
detected by estimating the most likely hidden state sequence from observed word arrival rates. [60] takes this idea
further. After periods of bursty activity for each word are identified, “bursty words” are grouped together into
“bursty events”.
The techniques presented above are off-line and retrospective, i.e., they are applied to a static snapshot of
CHAPTER 2. RELATED WORK 20
a dynamic document collection. A more relevant research challenge in modern real-time text streams, such as
micro-blogging services, is on-line detection of events, ideally as early in their genesis as possible.
For instance, [113] introduces a technique for detecting and tracking topics in an on-line fashion. The text
stream is modeled using a mixture of drifting word distributions corresponding to topics. Statistical techniques
are employed to dynamically select the “optimal” number of components in the mixture, i.e., the number of
topics currently present in the most recent documents of thedynamic collection. The need to introduce of a new
component is indicative of the emergence of a new topic in thetext stream.
A different on-line “event detection” problem, unrelated to topic mining, is studied in [110]. Content pub-
lished on-line attracts varying degrees of attention by users. Depending on the context, the attention received is
materialized as links to the document, positive votes, etc., that are spread in time. [110] performson-line and
earlydetection of documents that receive an unexpectedly high degree of attention, based on the attention gather-
ing intensity that has been normal for the document’s source. The decision to pro-actively declare a document as
“attention gathering” is performed using sequential statistical tests.
The applications described above, mine dynamic document collections for patterns, topics and events. How-
ever, the dynamic nature of modern document collections is also redefining search, since on many occasions
documentrecencyneeds to be incorporated into ranking decisions. [52] discuss “breaking-news queries”, for
whom document recency should be a defining feature of result relevance. Intuitively, breaking-news queries are
queries about on-going events. The system presented in [52]detects breaking-news queries using a classifier and
responds appropriately by using an appropriate ranking model for such queries.
In Chapter 6 we introduce a novel technique for querying dynamic collections of extended documents associ-
ated with categorical attributes. The solution recognizesthe extra information content carried by newly generated
documents and responds by identifying among the most recentdocuments, a subset which according to user
preferences is “uniquely interesting”.
2.7 Integrated Document Collections
Information available on-line is extremely heterogeneous. The complications of this heterogeneity are typically
resolved by developing techniques for querying, exploringand mining a coherent subset of the data at a time.
For instance, there exist specialized vertical search engines for querying Web pages3, news articles4, blog posts5,
micro-blog posts6, product reviews7, etc. In addition to textual information, a wealth of structured information
is also available on-line, residing in Web accessible “deepweb” databases [23]. Again, vertical search engines
provide access to a subset of this information at a time: products8, flight information9, etc.
Major search engines are responding to this fragmentation.They are evolving from glorified information
retrieval algorithms operating on the corpus of Web pages, to query answering ecosystems acting as a single
access point to all the information available on-line, textual or structured. Depending on the query, data collections
alternative to the Web page index are used to serve information. Perhaps the most simple and familiar example is
a query about the weather, e.g., “toronto weather”. The top result of such a query is actually the current weather
in Toronto and a weather forecast, rather than a link to a Web page. This information is obviously retrieved from
an ancillary data source, perhaps the database of a partner Web site. A large fraction of queries can greatly benefit
from the integration of data from alternative sources into Web search results: queries about products (e.g., “4mp
sony cameras”), reviews (e.g., “good toronto hotels”), flights (e.g., “toronto to chicago flights”).
The central challenge in querying integrated document collections is determining the most relevant document
collections for each query. A widespread approach is to train classifiers and use them to route a web query into
the appropriate document collections or structured data collections. Recent work [11, 12] suggest the usage of
multiple features extracted from the query string (e.g., keywords such as “jobs”, “reviews”, etc.), query log data
and click-through data (queries and clicks that reached a collection) and the document collections themselves
(similarity of query to collection’s text).
Of particular interest is the interaction of Web search withreal-time, dynamic document collections such as
news articles and micro-blog posts. Whether a query is “newsworthy” and would hence benefit from the inclusion
of news articles and micro-blog posts generated in real timedepends on the time that it is issued. It would make
sense to integrate fresh news in the results of a query for “toronto blue jays” around the time of a relevant sports
event, but not otherwise. The approach adopted by Diaz et al.[51] searches for joint bursts of activity in queries
and the news collection. It then speculatively presents relevant news articles in response to the “bursty” queries
and determines whether the query is newsworthy or not based on how often the news articles are preferred over
the regular search results.
Integrating structured data collections into Web search raises additional challenges. Evidently, a query classi-
fication approach could be used [11, 12] in order to determinewhether a structured data source should be utilized
to answer the query. However, such queries could greatly benefit from a deeper structural analysis, which can
additionaly identify which structured attributes are present in the query. For example, besides simply identifying
that query “4mp sony cameras” can be issued to a product database, we would like to identify in particular that
products ofType = Digital Cameraand characteristicsResolution = 4 MP, Brand = Sonyare requested. In Chap-
8http://www.bing.com/shopping9www.kayak.com
CHAPTER 2. RELATED WORK 22
ter 7 we further discuss the benefits of such an approach and present an effective and efficient solution for this
task.
Besides querying, text mining applications can benefit fromthe use of multiple, integrated document collec-
tions. For instance, Topic Detection algorithms can be applied to news articles, blog posts or micro-blog post
collectionsindependently. Nevertheless, it would be preferable to identify the same topic in all three document
collections. This would provide both richer context and improved topic detection quality. To this end, recent work
extends the PLSA approach [75] of detecting topics as word distributions, to topics spanning multiple static [157]
or dynamic collections [147, 146].
Another example that demonstrates the benefit of integrating document collections in text mining is offered
by [78]. In traditional document clustering [34] documentsare viewed as bag of words. [78] suggest enhancing
this representation by utilizing the Wikipedia text corpus. In a sense, an extra transitive similarity link between
two documents is established through their similarity to the same Wikipedia documents, which implies a “topical”
connection.
Part I
Exploration
23
Chapter 3
Interactive Exploration of Extended
Document Collections
3.1 Introduction
In this Chapter we introduce techniques that enable the intuitive exploration of extended document collections.
This is accomplished by leveraging two distinct but complementary types of meta-data: mentions of interesting
and relevantentitiesextracted from the documents’ textual component andcategorical document attributes.
Let us illustrate the proposed functionality by considering a collection of blog posts as a concrete example.
One can utilize an entity extractor and obtain entities of interest such as people, locations, products, companies,
etc., mentioned in the posts. Then, posts by bloggers discussing the “Dark Knight” movie, mention also ac-
tor names such as “Heath Ledger” and “Christian Bale”. Postsdiscussing the Canadian “Listeriosis” outbreak
mention the disease in conjunction with “Public Health Agency of Canada” and locations like “Canada” and
“Toronto”. The key observation is that related posts, capturing the samestory, mention approximately the same
group of core entities. By identifying such groups ofstrongly associated entities, i.e., groups of entities that are
recurrently mentioned together, we implicitly detect the underlying event of which they are the main actors.
Besides identifying strong entity associations (and, hence, the underlying events) in the entire post collec-
tion, we would like to utilize the fact that bloggers typically reveal their demographic profile, such as their age,
gender, occupation and location. This information allows us to expose the stories capturing the attention ofeach
demographic segment, such as “people in the US” or “young males in the US”.
Then, relevant entity associations can be presented on demand as we “drill-down” to the demographic of in-
terest, or compared for different demographics. Once an entity-group of interest has been identified, the most
24
CHAPTER 3. INTERACTIVE EXPLORATION OF EXTENDED DOCUMENT COLLECTIONS 25
relevant or influential posts associated with it can be easily fetched and browsed. In this manner, a deeper un-
derstanding of the underlying event can be achieved. For a demonstration of this functionality and its utility see
[9].
To support suchinteractivebrowsing and exploration of the document collection, we need to pre-compute and
materialize strongly associated entities for all attribute value combinations that can be possibly requested. We
refer to the fraction of documents matching a certain attribute value restriction as asliceof the collection (e.g.,
a slice can be a restriction on demographic attributes in thecase of blogs). Depending on the application, two
variations of our core problem are of interest: computing all sufficiently strong entity associations (Threshold
Variation) and computing the top-k strongest entity associations (Top-k Variation), for all the different slices of
the extended document collection. The Top-k Variation is particularly interesting and highly useful. Given that
associations are meant to be browsed, identifying thek most pronounced ones eliminates the need for a detection
threshold that could lead to the computation of too few or toomany associations.
Additionally, note that we used the notion of association among entities in a very generic sense. Depending on
the application context, robust measures ofstatistical correlationor even simpler measures of set-overlap might
be appropriate for quantifying the degree of association between entities. Given the wide applicability of the
proposed functionality,all plausible measures should be supported. Such flexibility enables the use of complex
measures such as theLikelihood Ratiostatistical test [148], whose unique properties and behavior [54] render it
ideal for exposing interesting and meaningful entity associations in user-generated content.
The relentless pace at which user-generated content is accumulated, combined with the need to analyze up-to-
date data, necessitatehighly efficient solutionsfor both the Threshold and the Top-k variations of the problem. In
order to address this challenge, we make the following contributions.
• We develop algorithm THR-ENT for addressing the Threshold variation of the problem. THR-ENT elimi-
nates from consideration provably weak associations.
• We develop algorithm TOP-ENT for addressing the Top-k variation of the problem. TOP-ENT supports
early termination as soon as it can guarantee that the top-k associations computed so far constitute the final
result.
• THR-ENT and TOP-ENT are designed with the explicit goal of supporting virtuallyany association measure,
no matter how complex.
• We identify and exploit computation sharing and optimization opportunities, as well as accuracy/efficiency
trade-offs offered by the overlap among slices.
CHAPTER 3. INTERACTIVE EXPLORATION OF EXTENDED DOCUMENT COLLECTIONS 26
• We demonstrate the efficiency and applicability of the proposed techniques using both synthetically gener-
ated and real world data, comprised of 1.4M blog posts pre-processed by a custom entity extractor.
Part of the work presented in this Chapter also appears in [130] and is organized as follows: In Section 3.2 we
discuss existing work. In Section 3.3 we formally define the problems we need to address. Section 3.4 presents
our core algorithmic techniques and Section 3.5 presents the infrastructure required for their application. Further
optimization opportunities and trade-offs are explored inSection 3.6. Section 3.7 discusses useful extensions of
our techniques. In Section 3.8 we evaluate the performance of the proposed solutions and highlight the usefulness
of the Likelihood Ratio test. We conclude in Section 3.9.
3.2 Comparison to Existing Work
The core problem of identifying strong entity associationsis related to association rule mining and set-similarity
search. In association rule mining [6, 30], a collection of item-sets (extended documents) is mined in order to,
essentially, identify frequently co-occurring items (entities). In set-similarity search [44, 19, 150], a collection of
sets (entities) is probed by a query-set, and collection-sets with sufficiently high overlap are retrieved. However,
existing techniques from these two domains cannot satisfy the significantly richer requirements of our application.
First, the proposed techniques support virtually any measure of association between entities. Existing algo-
rithms are tailored around a single measure, e.g. support/confidence [6], theX 2 test [30] or simple set-similarity
measures [19, 150]. The flexibility offered by the solutionsintroduced enable the use ofcomplex, non-linear but
robustassociation measures like the Likelihood Ratio test. As demonstrated before [54] and advocated in Section
3.8, theuniqueandintuitive behavior of the Likelihood Ratio renders it ideal for use in our setting. To the best
of our knowledge, no previous algorithm is general enough tosupport the Likelihood Ratio or any other plausible
measure of association.
Second, the proposed techniques support the efficient computation of thek strongest entity associations. Such
functionality is extremely powerful as it eliminates the need to set a sensitive detection threshold that could lead
to the detection of too few or too many associations.
Third, we explore in depth and exploit computation sharing opportunities arising from the need to compute
entity associations for all the slices of the extended document collection. These traits further distinguish the
solutions introduced from previous work on association rule mining and set-similarity search.
In the context of Topic Detection and Tracking, [142, 139] utilize entities extracted from the documents
in order to identify topics. Topics, for a single day, are detected by first computing pairs of entities found to
be associated using theX 2 test and then further grouping these entity pairs into entity clusters. Our approach
extends the ideas of [142, 139]. In addition to entities, we leverage categorical document attributes to identify
CHAPTER 3. INTERACTIVE EXPLORATION OF EXTENDED DOCUMENT COLLECTIONS 27
entity associations in all the slices of the extended document collection and, thus, enable deeper understanding of
the content and enhanced exploratory functionality. Furthermore, we focus on efficiency and applicability of the
proposed techniques to massive document collections, an issue left unaddressed in [142, 139].
3.3 Formal Problem Statement
Consider a collectionD of n extended documentsd1, . . . , dn whose meta-data are a set of entities and categorical
attributes. Each extended document is associated withl attributes denoted withA1, . . . , Al. We denote with
di(A1, . . . , Al) the meta-data attribute values annotatingdi. To ease the exposition of our ideas (and without
sacrificing generality) we assume that the attribute domainsDom(A1), . . . , Dom(Al) are unordered. In general,
the domains can be partially ordered, e.g., they can form a hierarchy.
Let A ∈ Powerset(A1, . . . , Al) be a subset of thel attributes andA× be the cartesian product of their cor-
responding domains. For example,A = A1, A2, andA× = Dom(A1) × Dom(A2). SetA× is essentially
comprised of all the value combinations of the attributes contained inA. Each elementa of setA× defines aslice
sD(a) of the extended document collection: the slice is comprisedof the documents whose attribute values match
those ina, i.e.,sD(a) = d ∈ D|a ⊆ d(A1, . . . , Al).
Example 3.1. Our extended document collection is comprised of blog postsassociated with two attributes spec-
ifying the blogger’s demographic profile: (A)ge and (G)ender, so that (A)ge= young,old and (G)ender=
male,female. The two meta-data attributes and their domains define nine distinct slices of the collection:
(young,male), (young,female), (old,male), (old,female), (young), (old), (male), (female) and (), where the last
slice is essentially the entire post collection.
The application of an entity extraction algorithm onD reveals the mentions ofm distinct entitiese1, . . . , em
in the documents1. The set of entities mentioned in a document are part of its meta-data. It will be convenient
to represent the available information with respect to the occurrences of entities in documents by means of an
entity-document occurrence matrixOm×n.
Definition 3.1. The entity-document occurrence matrixOm×n is a binary matrix such that elementoij = 1 only
if entityei is mentioned in extended documentdj .
Thedocument-listei = dj |dj mentionsei of entityei corresponds to thei-th row of matrixO. Respectively,
theentity-listdj = ei|ei mentioned indj of documentdj corresponds to thej-th column of the matrix. The
1Certain classes of entity extraction algorithms attach aconfidencevalue to each entity. For those algorithms, we assume that anappropriatethreshold has been selected and only the entity matches withconfidence exceeding the threshold are reported.
CHAPTER 3. INTERACTIVE EXPLORATION OF EXTENDED DOCUMENT COLLECTIONS 28
document-lists and entity-lists are essentially row-oriented and column-oriented representations of the sparse
matrixO.
We denote withci = |ei| the number of documents that mention entityei and withcij = |ei ∩ ej | the number
of documents where entitiesei andej are mentioned together. Lastly, we denote withesi the documents where
entityei occurs,in the context of a specific slices, i.e.,esi = dj ∈ s|dj mentionsei = ei ∩ s.
At the core of the proposed solution lies the need to materialize strongly associated groups of entities for all the
slices of the document collection. In what follows, we concentrate on the computation of associated entitypairs
(i.e., groups of two entities). The definitions and techniques introduced are extended for associations involving an
arbitrary number of entities in Section 3.7.
There exists a wide range of alternatives that can be appliedto assess the degree of association between
entities. Some of the most commonly used measures of association can be broadly classified into two categories:
statistical correlation measures and set-similarity measures2.
Statistical measures of correlation treat entities as random variables and assess their association using a statis-
tical hypothesis test[148]. The base hypothesis typically used is that two entitiesei andej occur in the relevant
slice of the collection independently of one another, i.e.,if pi (pj) is the probability that a document containsei
(ej) andpij is the probability that a document contains both entities, thenpij = pipj . For significantly associ-
ated entities, co-occurring frequently in documents, we have thatpij ≫ pipj . Statistical tests produce numerical
values which can be mapped to the likelihood that, for the entities examined, the assumption ofindependentoc-
currence in documents does not hold. Higher values indicatea higher likelihood that the assumption is violated
and therefore signify stronger association. Two of the mostcommon statistical hypothesis tests are theX 2 test
and the Likelihood Ratio test [148].
Consider two entitiesei andej and a collection comprised ofn extended documents. We denote withN11
the number of documents actually containing both entities.In a similar manner, letN10 (N01) be the number of
documents containing entityei (ej) but not entityej (ei) andN00 the number of documents containing neither
entity. We also denote withE11, E10, E01, E00 theexpectedvalues of these quantities under the independence
assumption for entitiesei andej . The values of both the observed and expected quantities canbe easily expressed
as a function of the sizen of the underlying document collection, the occurrencesci, cj and co-occurrencecij of
the two entities.
• X 2 : X(ei, ej) =∑
x∈0,1
∑
y∈0,1
(Nxy − Exy)2
Exy
• Likelihood Ratio:L(ei, ej) =∑
x∈0,1
∑
y∈0,1
2Nxy lnNxy
Exy
2Probability metrics is a third popular category [62]. Such measures are also supported by the techniques subsequently introduced.
CHAPTER 3. INTERACTIVE EXPLORATION OF EXTENDED DOCUMENT COLLECTIONS 29
Set-similarity measures of association treat entities as the sets of documents where they appear, and attempt
to quantify entity association as set overlap. Perhaps the most widely used set-similarity measure is the Jaccard
Coefficient, although other measures, like the Dice Coefficient are also used [105].
• Jaccard Coefficient:J(ei, ej) =|ei∩ej ||ei∪ej |
=cij
ci+cj−cij
• Dice Coefficient:D(ei, ej) =2|ei∩ej ||ei|+|ej |
=2cijci+cj
By carefully inspecting the aforementioned, as well as other, measures for assessing the degree of association
of an entity pair, we observe a number of shared mathematicalproperties. These properties capture a series
of intuitive characteristics that one would expect from a measure of association and strength of co-occurrence
between two entities.
Definition 3.2. An association measureM is a real functionM(ei, ej) = M(ci, cj , cij) of three variablesci =
|ei| (occurrences of entityei), ci = |ej| (occurrences of entityej) andcij = |ei ∩ ej | (co-occurrence of entities
ei andej). The function has the following properties.
We formalize this intuition with the notion ofsurprise[137, 30, 53]. Letp(wi) be the probability of word
wi appearing in a document of the collection andp(w1, . . . , wr) be the probability of wordsw1, . . . , wr co-
occurringin a document1. If wordsw1, . . . , wr were unrelated and were used in documentsindependentlyof one
another, we would expect thatp(w1, . . . , wr) = p(w1) · · · p(wr). Therefore, we use a simple measure to quantify
by how much the observed word co-occurrences deviate from the independence assumption. For a word-set
F = w1, . . . , wr, we define
Surprise(F ) =p(w1, . . . , wr)
p(w1) · · · p(wr)
We argue that when considering a number of possible query expansionsFr(w1, . . . , wl), word-sets with high
surprise values constitute ideal suggestions: we identifycoherent clusters of documents within the original result
set that are connected by a common underlying theme, as defined by the co-occurring words.
The use of surprise (unexpectedness) as a measure of interestingness has also been vindicated in the data
mining literature [137, 30, 53]. Additionally, the definition of surprise that we consider is simple yet intuitive and
has been successfully employed [30, 53].
Example 4.3. Consider a collection comprised of 250k documents and query“table, tennis”. Suppose that there
exist 5k documents containing “table”, 2k documents containing “tennis” and 1k documents containing both
words “table, tennis”. We easily compute thatSurprise(table,tennis)=25.
Let us compare the surprise value of two possible expansions: with term “car” (10k occurrences) and term
“paddle” (1k occurrences). Suppose (reasonably) that “car” is not particularly related to “table, tennis” and
therefore co-occurs independently with these words. Then,there exist 40 documents in the collection that contain
all three words “table, tennis, car” (Figure 4.1). We compute that Surprise(table,tennis,car)=25. While this
expansion has a surprise value greater than 1, this is due to the correlation between “table” and “tennis”.
Now, consider the expansion with “paddle” and assume that 500 of the 1000 documents containing “table,
tennis” also contain “paddle” (“table, tennis, paddle”). We compute thatSurprise(table,tennis,paddle)=3125.
As this example illustrates, enhancing queries with highlyrelevant terms results in expansions with considerably
higher surprise values than enhancing them with irrelevantones.
The maximum-likelihood estimates of the probabilities required to compute the surprise value of a word-set
are derived from the textual data of the extended document collectionD under consideration. We usec(F ) =
c(w1, . . . , wr) to denote the number of documents in a collectionD that contain allr words ofF . In the same
spirit, we denote byc(wi) the number of documents that contain wordwi andc(•) the total number of documents
in the collection. Then, we can estimatep(w1, . . . , wr) = c(w1, . . . , wr)/c(•) andp(wi) = c(wi)/c(•). Using
1If considered appropriate, more restrictive notions of co-occurrence can also be used, e.g., the words appearing within the same paragraph in a document.
This intuitive observation has also been validated by previous work on the dynamics of collaborative tagging.
[64, 70] demonstrated that the distribution of tags for a specific document converges rapidly to a remarkably stable,
heavy tailed distribution that is lightly affected by additional assignments. The heavy tailed distribution ascertains
the dominance of a handful of influential trends in describing a document’s content. The rapid convergence and
the stability of the distribution points to its predictability: namely, after witnessing a small number of assignments,
we should be able to predict with a high degree of confidence subsequent tag assignments.
Given the fast crystallization of users’ opinion about the content of a document, we can make a natural as-
sumption that will serve as a bridge between our ability to predict the future tagging activity for a document and
our need to computep(q|d is relevant).
Users will use keyword sequences derived from the same distribution to both tag and search for a document.1
This logical link allows us to equate the probabilityp(q|d is relevant) to the probability of an assignment
containing the same keywords asq being used to tag the document, i.e.,
p(q|d is relevant) = p(q is used to tagd)
The stability of the tag distribution allows us to accurately estimate the probability of a tag being used in the
future, based on the document’s tagging history. However, assignments are rarely comprised by a single tag. In
our study (Section 5.6) we observed that the average length of an assignment is2.77 tags. It is reasonable to
expect that neither the order in which tags are placed in an assignment, nor the co-occurrence patterns of tags in
assignments are random.
In fact, [64] observed that tags are not used in random positions within an assignment, but rather progress
(from left to right) from more general to more specific and idiosyncratic. Therefore, assignments are not orderless
sets of tags, but sequences of tags, whose ordering tends to be consistent across the assignments attached to a
document, and consequently the queries used to search for it.
Additionally, tags representing different perspectives about a document’s content, although popular in their
own right, are less likely to co-occur in the same assignment.
Example 5.2. In our del.icio.us crawl, the Mozilla project main page is heavily annotated with tags “open-
source”, “mozilla” and “firefox”. We observed that tags “opensource” and “firefox” appear together much less
frequently than expected given their popularity, demonstrating two different perspectives for viewing the web site:
as the home of the Firefox browser or as an open source project. Such statistical deviations, more or less severe,
were observed throughout the del.icio.us collection.
1Recent research [32] indicates that this assumption holds to a satisfactory degree. The two distributions are similar enough for ourassumption to be plausible, but different enough for tags toprovide incremental information not present in query logs.
tuples. Given the high dimensionality of the SDC grid, smallchanges in granularity can have huge impact on
performance, as the number and size of cells is affected in anexponential manner. Nevertheless, for each indi-
vidual experiment we used for SDC the granularity values that resulted in the best performance. Instead, for the
STARS technique we kept the grid granularity at 10 buckets per dimension. Later in the section, we vary the grid
granularity for STARS and observe performance trends.
10K 20K 50K 100K 200K 500K 1000K0
1
2
3
4
5
6
7
8
9
104d Data
Buffer size
Tim
e pe
r up
date
(m
s)
STARS
10K 20K 50K 100K 200K 500K 1000K0
0.02
0.04
0.06
0.08
0.12d Data
Buffer size
Tim
e pe
r up
date
(m
s)
10K 20K 50K 100K 200K 500K 1000K0
0.25
0.5
0.75
1
1.25
1.53d Data
Buffer size
Tim
e pe
r up
date
(m
s)
STARS
SDC
STARS
SDC
(a) (b)
(c)
Figure 6.11: Effect of buffer size and number of attributes on performance.
As it is obvious in Figures 6.11(a)-(c), STARS outperforms SDC by an order of magnitude. In Figure 6.11(c)
we completely omitted the SDC technique, since its performance for documents with four attributes deteriorated.
CHAPTER 6. SKYLINE MAINTENANCE FOR DYNAMIC DOCUMENT COLLECTIONS 143
Furthermore, the time required by STARS in order to handle a buffer update is in the order of a millisecond or
less, thus rendering its use in real life applications entirely realistic. We will provide further evidence that support
this claim when we subsequently present our real data experiments.
Note that the performance trends in Figures 6.11(a)-(c) canbe non-monotone with respect to the size of the
buffer. This is to be expected, as there exist two competing trends that depend on the buffer size and affect
performance. For example, as the buffer size increases, we can expect both the size of the skybuffer and the
size of the skyline to increase. On the other hand, as the buffer size increases, the probability that the expiration
of a document will affect the skyline decreases and so does the probability that an expensive skyline mending
operation will have to be triggered. Remember that when a document belonging to the skyline expires, we need to
identify all the documents that it exclusively dominated and insert them in the skyline. The converse is true when
the buffer size decreases: the skyline size decreases whilethe invocations of reconstruction operations increase.
We performed additional experiments involving tree and wall domains with a wide range of parameters values
(l, h, c). In these experimental results we observed similar trends and performance differences between the two
techniques. The shape and size of the attribute domains influence performance mostly indirectly, by affecting the
average size of the skyline: domains that produce larger skylines are associated with higher buffer update cost.
Keeping two of the poset parameters(l, h, c) fixed, decreasingc (internal poset connectivity) increases the skyline
size, and so does increasingl (number of poset values) and decreasingh (the number of depth levels).
10K 20K 50K 100K 200K0
0.5
1
1.5
2
2.5
3DMV data set
Buffer Size
Tim
e pe
r up
date
(m
s)
STARS
Figure 6.12: Performance on skewed real data.
Besides synthetic data, we also employed a stream of real, skewed and correlated data for our performance
experiments. We used the categorical tuples of the DMV data set [80] as “documents”. They were associated with
three categorical attributes of the “cars” table: Maker/Model (38 possible values), Color (504 possible values)
CHAPTER 6. SKYLINE MAINTENANCE FOR DYNAMIC DOCUMENT COLLECTIONS 144
and Year (74 possible values). The resulting documents haveattribute values that are skewed and correlated. The
degree of skew and correlation in this real data set is fixed. Also, since the categorical attributes are not associated
with a partial-order, we manually organized their values intree-structured posets (Figure 6.10(a)).
Figure 6.12 depicts the results of the experiment. For buffer sizes between 10K and 200K, the time required
by STARS to process a buffer update is in the order of a millisecond - an entirely realistic figure. SDC’s results are
omitted from the figure as it fared poorly: the update time ranged from 6ms for a 10K buffer to more than 100ms
for a large, 200K buffer. The figure demonstrates a clear upward trend in the update time for larger buffer sizes.
This can be attributed to the presence of skew and correlation in the attribute values that leads to considerably
larger skylines as the buffer size increases.
Our next experiment utilizes synthetic data in order to offer insight on the performance differential between
SDC and STARS. Figure 6.13 depicts the size of the skybuffer maintained by the techniques , for documents with
two (Figure 6.13(a)) and three (Figure 6.13(b)) attributes. The domains were trees with parameters(500, 8, 0.3).
10K 20K 50K 100K 200K 500K 1000K0.1K
1K
10K
100K
1000K2d Data
Buffer size
Sky
buff
er s
ize
10K 20K 50K 100K 200K 500K 1000K
1K
10K
100K
1000K
0.1K
3d Data
Buffer size
Sky
buff
er s
ize
STARS
SDC
STARS
SDC
(a) (b)
Figure 6.13: Size of the skybuffer as a function of the buffersize.
SDC employs a mapping of documents withm categorical attributes to2m-dimensional numerical tuples.
However, the mapping is not exact in the sense that dominancein the categorical space does not imply dominance
in the numerical space (Figure 6.9). The implication of thisrelation is that when a document is inserted in
the skybuffer, a rectangular range search fails to identifyall the skybuffer documents that are dominated by the
inserted document. Therefore, the skybuffer of SDC can contain many more documents that the skybuffer of
STARS, since documents that could have been removed, still reside in the skybuffer. This has a big impact when
SDC attempts to repair the skyline after a document expiration. Then, the additional documents in the skybuffer
that need to be examined become a huge burden.
Notice that both axes are in logarithmic scale. In the case ofdocuments with two attributes, the skybuffer size
for SDC is greater than STARS’s, yet it is still manageable. However, for documents with three attributes, SDC’s
CHAPTER 6. SKYLINE MAINTENANCE FOR DYNAMIC DOCUMENT COLLECTIONS 145
skybuffer size explodes: as the number of attributes increases, the number of categorical dominance relations that
the mapping to the numerical domain fails to capture, increases.
Further evaluation of STARS
We also designed and performed experiments to further studythe STARS technique and its two components.
In particular, we performed experiments to quantify the pruning efficiency of the arrangement-based skyline
organization, as well as the gains that can be achieved by utilizing the proposed granularity setting mechanism in
conjunction with the poset partitioning technique.
2d 3d 4d0
0.05
0.1
0.15
0.2Tree Poset
Data dimensionality
Pru
ning
Eff
icie
ncy
2d 3d 4d0
0.05
0.1
0.15
0.2Wall Poset
Data dimensionality
Pru
ning
Eff
icie
ncy
(a) (b)
Figure 6.14: Pruning efficiency of arrangement skyline organization.
Figures 6.14(a) and 6.14(b) present results demonstratingthe pruning efficiency of the arrangement-based
skyline organization. A dominance query against the skyline determines whether a query document is dominated
by the skyline. The objective is to do so by checking as few skyline documents as possible. Therefore, pruning
efficiency is measured as the fraction of skyline documents that need to be examined on average in order to answer
a dominance query.
As it is evident in Figures 6.14(a) and 6.14(b), the arrangement organization is able to answer a dominance
query by considering on average about10% of the skyline documents. This is true for both tree and wall struc-
tured domains. For this experiment, we materialized the tree structured domains with parameters(500, 8, 0.3)
and the wall domains with parameters(250, 10, 0.3). Notice, that the pruning efficiency decreases as the dimen-
sionality increases, although not considerably. This is reasonable to expect, since the pruning technique utilizes
information from only two of the document’s attributes, therefore failing to exploit some pruning opportunities as
the dimensionality increases.
Our next experiment demonstrates the potential performance benefits by utilizing the techniques of Section
6.4.3, that allow us to set the skybuffer grid granularity. This involves partitioning the attribute domains in disjoint
CHAPTER 6. SKYLINE MAINTENANCE FOR DYNAMIC DOCUMENT COLLECTIONS 146
value groups so that the desired grid granularity is matchedand the expected query time is minimized. For this
experiment, we measured the average time in milliseconds required to perform a skybuffer update, i.e., identify
all the documents in the skybuffer dominated by an incoming document, remove them and insert the incoming
document in the skybuffer.
0 10 20 30 40 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Grid granularity
Tim
e pe
r up
date
(m
s)
0 10 20 30 40 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Grid granularity
Tim
e pe
r up
date
(m
s)
Tree Poset Wall Poset
(a) (b)
Figure 6.15: Effect of grid granularity on performance.
Figure 6.15 depicts the time required to perform a skybufferupdate versus the grid granularity, for docu-
ments with three attributes. The grid granularity is measured as the number of allocated buckets per dimension.
More specifically, Figure 6.15(a) presents the results whenthe domains are tree structured posets with param-
eters(200, 4, 0.1), while Figure 6.15(b) the results when the domains are wall structure posets with parameters
(200, 4, 0.1).
The leftmost values in the plots correspond to a partitioning that allocates a single bucket per depth level. By
increasing the granularity we can achieve better performance. This performance increase would not be possible
without our poset partitioning heuristic: a bad partitioning strategy would result in performance inferior to the
bucket-per-depth-level, partitioning scheme. However, the poset partitioning technique allows us to translate
an increase in the grid granularity to increase in performance. The benefit of increasing the grid granularity is
eventually offset by the overhead of visiting many sparse cells. This additional overhead explains the knee in the
curves of Figure 6.15. Notice that even though performance can be poor for extreme granularity values, there is a
wide range of values that offer near optimal performance.
Summary
To summarize, our experimental evaluation demonstrated the applicability of the proposed solution to a wide
range of buffer sizes and dimensionality (number of attributes), for both synthetic (Figures 6.11(a)-(c)) and real
(Figure 6.12) data. We also verified our claim that the skybuffer indexing technique can adapt to posets of any
CHAPTER 6. SKYLINE MAINTENANCE FOR DYNAMIC DOCUMENT COLLECTIONS 147
shape and size by offering flexibility in controlling the granularity of the grid-based indexing structure (Figure
6.15). The second claim that we verified was the pruning efficiency of the skyline organization, which was also
found to be resilient to increases in dimensionality (Figure 6.14). Lastly, we demonstrated the inapplicability of
the existing offline skyline evaluation techniques of [35] in a streaming environment (Figures 6.11(a)-(c)) and
identified the inherent reasons behind this poor performance (Figure 6.13).
6.6 Conclusions
In this Chapter, we identified and motivated the problem of maintaining the skyline of a dynamic collection of
extended documents associated with partially ordered categorical meta-data attributes, and realized two novel
techniques that constitute the building blocks of an efficient solution to the problem.
We introduced a lightweight data structure for indexing thedocuments in the streaming buffer, that can grace-
fully adapt to documents with many attributes and partiallyordered domains of any size and complexity. We
subsequently studied the dominance relation in the dual space and utilized geometric arrangements in order to
index the categorical skyline and efficiently evaluate dominance queries. Lastly, we performed a thorough exper-
imental study to evaluate the efficiency of the proposed techniques.
Chapter 7
Integrating Structured Data into Web
Search
7.1 Introduction
In Section 2.7 we reviewed how search engines are evolving from textual information retrieval systems to highly
sophisticated answering ecosystems utilizing information from multiple diverse sources. One such valuable
source of information is structured data, abstracted as relational tables, and readily available in publicly accessible
data repositories or proprietary databases.
Structured data can be used to better serve a large body of user queries that target information which does not
reside on a single or any Web page. Queries about products (e.g., “50 inch LG lcd tvs”, “orange fendi handbag”,
“white tiger book”), movie showtime listings (e.g., “indiana jones 4 near boston”), airline schedules (e.g., “flights
from boston to new york”), are only a few examples of queries that are better served by directly using information
from structured data. This information is presented prominently along with regular search results (Figure 7.1).
However, integrating in this manner structured data collections and the traditional Web page index poses the
following important challenges:
Web speed:Web users have become accustomed to lightning fast responses. Studies have shown that even sub-
second delays in returning search results cause dissatisfaction among Web users, resulting in query abandonment
and loss of revenue for search engines.
Web scale:Users issue over 100 million Web queries per day. Additionally, there is an abundance of structured
data [23] already available within a search engine’s ecosystem from sources like crawling, data feeds, business
deals or proprietary information. The combination of the two makes an efficient end-to-end solution non trivial.
148
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 149
Figure 7.1: Integrating structured data into Web search
Free-text queries: Web users express queries in unstructured free-form text without knowledge of schema or
available databases. To produce meaningful results, querykeywords should be mapped to structure.
For example, consider the query “50 inch LG lcd tv” and assumethat there exists a table with information
on TVs. One way to handle such a query would be to treat each product as a bag of words and apply standard
Information Retrieval techniques. However, assume that LGdoesnotmake 50 inch lcd tvs – there is a 46 inch and
a 55 inch lcd tv model. Simple keyword search would retrieve nothing. On the other hand, consider a structured
query that targets table “TVs” and specifies attributesDiagonal = “50 inch”, Brand = “LG”, TV Type= “lcd
tv”. Now, the retrieval and ranking system can handle this query with a range predicate onDiagonaland a fast
selection on the other attributes.
Intent disambiguation:
Web users seek information and issue queries oblivious to the existence of structured data sources, let alone
their schema and their arrangement. A mechanism that directly maps keywords to structure can lead to misin-
terpretations of the user’s intent for a large class of queries. There are two possible types of misinterpretations:
between Web versus structured data, and between individualstructured tables.
For example, consider the query “white tiger” and assume there is a table available containing Shoes and one
containing Books. For “white tiger”, a potential mapping can beTable= “Shoes” and attributesColor = “white”
andShoe Line= “tiger”, after the popular Asics Tiger line. A different potential mapping can beTable= “Books”
andTitle = “white tiger”, after the popular book. Although both mappings are possible, it seems that the book is
more applicable in this scenario.
On the other hand, it is also quite possible the user was asking information that is not contained in our collec-
tion of available structured data, for example about “whitetiger”, the animal. In such case, presenting query results
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 150
with structured information about either books or shoes would be detrimental to user experience. Hence, although
multiple structured mappings can be feasible, it is important to determine which ones are at all meaningful. Such
information can greatly benefit overall result quality.
To address these challenges, we exploit latent structured semantics in Web queries to create mappings to
structured data tables and attributes. We call such mappingsStructured Annotations. For example an annotation
for the query “50 inch LG lcd tv” specifies theTable= “TVs” and the attributesDiagonal= “50 inch”, Brand=
“LG”, TV Type= “lcd tv”.
However, as we have already demonstrated with query “white tiger”, generating all possible annotations is
not sufficient. We need to estimate the plausibility of each annotation and determine the one that most likely
captures the intent of the user. To handle such problems we designed a principled probabilistic model that scores
each possible structured annotation. In addition, it also computes a score for the possibility of the query targeting
information outside the structured data collection. The latter score acts as a dynamic threshold mechanism used
to expose annotations that correspond to misinterpretations of the user intent.
The result is aQuery Annotatorcomponent, shown in Figure 7.2. It is worth clarifying that we are not
solving the end to end problem of including structured data in response to Web queries. That would include other
components such as indexing, data retrieval, ranking and presentation. OurQuery Annotatorcomponent sits on
the frond end of such end-to-end system. Its output can be utilized to route queries to appropriate tables and feed
annotation scores to a structured data ranker.
50" LG lcd Tagger
lcd
lcd
Scorer
Statistics
Candidate Annotations
A1:
A2:
A1: 0.92
Scored, Plausible
Annotations
Online
Offline
LearningQuery
LogDataData
Tables
LG50"
50" LG
Figure 7.2: Overview of Query Annotator
Our contributions with respect to the challenges of integrating structured data into Web search are as follows.
1. Web speed:We design an efficient tokenizer and tagger mechanism producing annotations in milliseconds.
2. Web scale:We map the problem to a decomposable closed world summary of the structured data that can
be done in parallel for each structured table.
3. Free-text queries: We define the novel notion of a Structured Annotation capturing structure from free
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 151
text. We show how to implement a process producing all annotations given a closed structured data world.
4. Intent disambiguation: We describe a scoring mechanism that sorts annotations based on plausibility.
Furthermore, we extend the scoring with a dynamic threshold, derived from the probability a query was not
described by our closed world.
Part of the work presented in this Chapter also appears in [134] and is organized as follows: We describe
the closed structured world andStructured Annotationsin Section 7.2. We discuss the efficient tokenizer and
tagger process that deterministically produces all annotations in Section 7.3. We define a principled probabilistic
generative model used for scoring the annotations in Section 7.4 and we discuss unsupervised model parameter
learning in Section 7.5. We performed a thorough experimental evaluation with promising results, presented
in Section 7.6. We conclude the chapter with a discussion of existing work in Section 7.7 and some closing
comments in Section 7.8.
7.2 Structured Annotations
We assume that our application maintains a Web page collection D and a collection of structured datatables
T = T1, T2, . . . , Tτ1. A tableT is a set of relatedentitiessharing a set ofattributes. We denote the attributes of
tableT asT.A = T.A1, T.A2, . . . , T.Aα. Attributes can be eithercategoricalor numerical. Thedomainof a
categorical attributeT.Ac ∈ T.Ac, i.e., the set of possible values thatT.Ac can take, is denoted withT.Ac.V . We
assume that each numerical attributeT.An ∈ T.An is associated with a singleunit U of measurement. Given a
set of unitsU we defineNum(U) to be the set of all tokens that consist of a numerical value followed by a unit in
U . Hence, thedomainof a numerical attributeT.An isNum(T.An.U) and thedomainof all numerical attributes
T.An in a table isNum(T.An.U).
An example of two tables is shown in Figure 7.3. The first tablecontains TVs and the second Monitors.
They both have three attributes: Type, Brand and Diagonal. Type and Brand are categorical, whereas Diagonal
is numerical. The domain of values for all categorical attributes for both tables isT .Ac.V = TV, Samsung,
Sony, LG, Monitor, Dell, HP. The domain for the numerical attributes for both tables isNum(T .An.U) =
Num(inch). Note thatNum(inch) does not include only the values that appear in the tables of the example,
but rather all possible numbers followed by the unit “inch”.Additionally, note that it is possible to extend the
domains with synonyms, e.g., by using “in” for “inches” and “Hewlett Packard” for “HP”. Discovery of synonyms
is beyond the scope of this chapter, but existing techniques[112] can be leveraged.
We now give the following definitions.
1The organization of data into tables is purely conceptual and orthogonal to the underlying storage layer: the data can bephysically stored in XML files,relational tables, retrieved from remote Web services, etc. Our assumption is that a mapping between the storage layer and the “schema” of table collectionT hasbeen defined.
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 152
Definition 7.1 (Token). A tokenis defined as a sequence of characters including space, i.e.,one or more words.
For example, the bigram “digital camera” may be a single token.
Definition 7.2 (Open Language Model). We define theOpen Language Model(OLM) as the infinite set of all
possible tokens. All keyword Web queries can be expressed using tokens fromOLM.
Definition 7.3 (Typed Token). A typed tokent for tableT is any value from thedomainofT.Ac.V ∪Num(T.An.U).
Definition 7.4 (Closed Language Model). TheClosed Language ModelCLM of tableT is the set of all duplicate-
free typed tokens for tableT .
For the rest of the chapter, for simplicity, we often refer totyped tokensas justtokens. The closed language
modelCLM(T ) contains the duplicate-free set of all tokens associated with tableT . Since for numerical attributes
we only store the “units” associated withNum(U) the representation ofCLM(T ) is very compact.
The closed language modelCLM(T ) for all our structured dataT is defined as the union of the closed lan-
guage models of all tables. Furthermore, by definition, if webreak a collection of tablesT into k sub-collections
T1, ..., Tk, thenCLM(T ) can be decomposed intoCLM(T1), ...,CLM(Tk). In practice,CLM(T ) is used
to identify tokens in a query that appear in the tables of our collection. So compactness and decomposability are
very important features that address the Web speed and Web scale challenges.
The closed language model defines the set of tokens that are associated with a collection of tables, but it does
not assign anysemanticsto these tokens. To this end, we define the notion of anannotated tokenandclosed
structured model.
Definition 7.5 (Annotated Token). An annotated tokenfor a tableT is a pair AT = (t, T.A) of a tokent ∈
CLM(T ) and an attributeT.A of tableT , such thatt ∈ T.A.V .
For an annotated tokenAT = (t, T.A), we useAT.t to refer to underlying tokent. Similarly, we useAT.T
andAT.A to refer to the underlying tableT and attributeA. Intuitively, theannotated tokenAT assigns structured
semantics to a token. In the example of Figure 7.3, the annotated token (LG, TVs.Brand) denotes that the token
“LG” is a possible value for the attribute TVs.Brand.
The ratio betweenλ/µ controls the confidence we place to the unigram model, versusthe possibility that the
free tokens come from the background distribution. Given the importance and potentially deleterious effect of
free tokens on the probability and plausibility of an annotation, we would like to exert additional control on how
free tokens affect the overall probability of an annotation. In order to do so, we introduce a tuning parameter
0 < φ ≤ 1, which can be used to additionally “penalize” the presence of free tokens in an annotation. To this end,
we compute:
P (w|T ) = φ(λP (w|UMT ) + µP (w|OLM))
Intuitively, we can viewφ as the effect of a process that outputs free tokens with probability zero (or asymptotically
close to zero), which is activated with probability1 − φ. We set the ratioλ/µ and penalty parameterφ in our
experimental evaluation in Section 7.6.
7.5.2 Estimating Template Probabilities
We now focus on estimating the probability of a query targeting particular tables and attributes, i.e., estimate
P (T.Ai) for an annotationSi. A parallel challenge is the estimation ofP (OLM), i.e., the probability of a query
being generated by the open language model, since this is considered as an additional type of “table” with a single
attribute that generates free tokens. We will refer to tableand attribute combinations asattribute templates.
The most reasonable source of information for estimating these probabilities is Web query log data, i.e., user-
issued Web queries that have been already witnessed. LetQ be a such collection of witnessed Web queries. Based
on our assumptions, these queries are the output of|Q| “runs” of the generative process depicted in Figure 7.5(b).
The unknown parameters of a probabilistic generative process are typically computed usingmaximum likelihood
estimation, that is, estimating attribute template probability values P (T.Ai) andP (OLM) that maximize the
likelihood of the generative process giving birth to query collectionQ.
Consider a keyword queryq ∈ Q and its annotationsSq. The query can either be the formulation of a request
for structured data captured by an annotationSi ∈ Sq, or free-text query described by theSOLM annotation. Since
these possibilities are disjoint, the probability of the generative processes outputting queryq is:
P (q) =∑
Si∈Sq
×P (Si) + P (SOLM) =
=∑
Si∈Sq
P (AT i,FT i|T.Ai)× P (T.Ai) + P (FT q|OLM)P (OLM)
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 165
A more general way of expressingP (q) is by assuming that all tables in the database and all possible combina-
tions of attributes from these tables could give birth to query q and, hence, contribute to probabilityP (q). The
combinations that do not appear in annotation setSq will have zero contribution. Formally, letTi be a table, and
let Pi denote the set of all possible combinations of attributes ofTi, including the free token emitting attribute
Ti.f . Then, for a table collectionT of size|T |, we can write:
P (q) =
|T |∑
i=1
∑
Aj∈Pi
αqijπij + βqπo
whereαqij = P (AT ij ,FT ij|Ti.Aj), βq = P (FT q|OLM), πij = P (Ti.Aj) andπo = P (OLM). Note that
for annotationsSij 6∈ Sq, we haveaqij = 0. For a given queryq, the parametersαqij andβq can be computed as
described in Section 7.5.1. The parametersπij andπo correspond to the unknown attribute template probabilities
we need to estimate.
Therefore, the log-likelihood of the entire query log can beexpressed as follows:
L(Q) =∑
q∈Q
logP (q) =∑
q∈Q
log
|T |∑
i=1
∑
Aj∈Pi
αqijπij + βqπo
Maximization ofL(Q) results in the following problem:
maxπij ,πo
L(Q), subject to∑
ij
πij + πo = 1 (7.6)
Condition∑
ij πij +πo = 1 follows from the fact that based on our generative model all queries can be explained
either by an annotation over the structured data tables, or as free-text queries generated by the open-wold language
model.
This is a large optimization problem with millions of variables. Fortunately, objective functionL(πij , πo|Q)
is concave. This follows from the fact that the logarithms oflinear functions are concave, and the composition of
concave functions remains concave. Therefore, any optimization algorithm will converge to a global maximum.
A simple, efficient optimization algorithm is the Expectation-Maximization (EM) algorithm [24].
Lemma 7.1. The constrained optimization problem described by equations 7.6 can be solved using the Expectation-
Maximization algorithm. For every query keyword queryq and variableπij , we introduce auxiliary variablesγqij
andδq. The algorithm’s iterations are provided by the following formulas:
• E-Step:γt+1qij = αqijπ
tij/ (
∑
km αqkmπtkm + βqπ
to)
δt+1q = βqπ
to/ (
∑
km αqkmπtkm + βqπ
to)
• M-Step:πt+1ij =
∑
q γt+1qij /|Q|
πt+1o =
∑
q δt+1q /|Q|
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 166
Proof. For a related proof, see [24].
The EM algorithm’s iterations are extremely lightweight and progressively improve the estimates for variables
πij , πo.
More intuitively, the algorithm works as follows. The E-step, uses the current estimates ofπij , πo to compute
for each queryq probabilitiesP (Sij), Sij ∈ Sq andP (SOLM). Note that for a given query we only consider
annotations in setSq. The appearance of each queryq is “attributed” among annotationsSij ∈ Sq andSOLM
proportionally to their probabilities, i.e.,γqij stands for the “fraction” of queryq resulting from annotationSij in-
volving tableTi and attributesTi.Aj . The M-step then estimatesπij = P (Ti.Aj) as the sum of query “fractions”
associated with tableTi and attribute setTi.Aj , over the total number of queries inQ.
7.6 Experimental Evaluation
We implemented our proposed Query Annotator solution in C#.We performed a large-scale experimental evalu-
ation utilizing real data to validate our ability to successfully address the challenges discussed in Section 7.1.
The structured data collectionT used was comprised of 1176 structured tables available to usfrom the Bing
search engine. In total, there were around 30 million structured data tuples occupying approximately 400GB on
disk when stored in a database. The same structured data are publicly available via an XML API.3
The tables used represent a wide spectrum of entities, such as Shoes, Video Games, Home Appliances, Televi-
sions, and Digital Cameras. We also used tables with “secondary” complementary entities, such as Camera Lenses
or Camera Accessories that have high vocabulary overlap with “primary” entities in table Digital Cameras. This
way we stress-test result quality on annotations that are semantically different but have very high token overlap.
Besides the structured data collection, we also used logs ofWeb queries posed on the Bing search engine. For
our detailed quality experiments we used a log comprised of 38M distinct queries, aggregated over a period of 5
months.
7.6.1 Algorithms
The annotation generation component presented in Section 7.3 is guaranteed to produce all maximal annotations.
Therefore, we only test its performance as part of our scalability tests presented in Section 7.6.5. We compare the
annotation scoring mechanism against a greedy alternative. Both algorithms score the same set of annotations,
output by the annotation generation component (Section 7.3).
3See http://shopping.msn.com/xml/v1/getresults.aspx?text=televisions for for a table of TVs and http://shopping.msn.com/xml/v1/getspecs.aspx?-itemid=1202956773 for an example of TV attributes.
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 167
Annotator SAQ: The SAQ annotator (Structured Annotator of Queries) stands for thefull solution introduced
in this work. Two sets of parameters affecting SAQ’s behavior were identified. The first, is thethresholdparameter
θ used to determine the set of plausible structured annotations, satisfying P (Si)P (SOLM) > θ (Section 7.4). Higher
threshold values render the scorer more conservative in outputting annotations, hence, usually resulting in higher
precision. The second are the language model parameters: the ratioλ/µ that balances our confidence to the
unigram table language model, versus the background open language model, and the penalty parameterφ. We fix
λ/µ = 10 which we found to be a ratio that works well in practice, and captures our intuition for the confidence
we have to the table language model. We consider two variations of SAQ based on the value ofφ: SAQ-MED
(medium-tolerance to free tokens) usingφ = 0.1, and SAQ-LOW (low-tolerance to free tokens) usingφ = 0.01.
Annotator IG-X : The Intelligent Greedy(IG-X) scores annotationsSi based on the number of annotated
tokens|AT i| that they contain, i.e., Score(Si) = |AT i|. The Intelligent Greedy annotator captures the intuition
that higher scores should be assigned to annotations that interpret structurally a larger part of the query. Besides
scoring, the annotator needs to deploy a threshold, i.e., a criterion for eliminating meaningless annotations and
identifying the plausible ones. The set of plausible annotations determined by the Intelligent Greedy annotator are
those satisfying (i)|FT i| ≤ X , (ii) |AT i| ≥ 2 and (iii)P (AT i|T.Ai) > 0. Condition (i) puts an upper boundX
on the number of free tokens a plausible annotation should contain: an annotation with more thanX free tokens
cannot be plausible. Note that the annotator completely ignores the affinity of the free tokens to the annotated
tokens and only reasons based on their number. Condition (ii) demands a minimum of two annotated tokens,
in order to eliminate spurious annotations. Finally, condition (iii) requires that the attribute-value combination
identified by an annotation has a non-zero probability of occurring. This eliminates combinations of attribute
values that have zero probability according to the multi-attribute statistics we maintain (Section 7.5.1).
7.6.2 Scoring Quality
We quantify annotation scoring quality using precision andrecall. This requires obtaining labels for a set of
queries and their corresponding annotations. Since manuallabeling could not be realistically done on the entire
structure data and query collections, we focused on 7 tables: Digital Cameras, Camcorders, Hard Drives, Digital
Camera Lenses, Digital Camera Accessories, Monitors and TVs. The particular tables were selected because
of their high popularity, and also the challenge that they pose to the annotators due to the high overlap of their
corresponding closed language models (CLM). For example, tables TVs and Monitors or Digital Cameras and
Digital Camera Lenses have very similar attributes and values.
The ground truth query set, denotedQ, consists of 50K queries explicitly targeting the 7 tables.The queries
were identified using relevant click log information over the structured data and the query-table pair validity was
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 168
manually verified. We then used our tagging process to produce all possible maximal annotations and labeled
manually the correct ones, if any.
We now discuss the metrics used for measuring the effectiveness of our algorithms. An annotator can output
multiple plausible structured annotations per keyword query. We define0 ≤ TP (q) ≤ 1 as the fraction of
correct plausible structured annotations over the total number of plausible structured annotations identified by
an annotator. We also define a keyword query ascoveredby an annotator, if the annotator outputs at least one
plausible annotation. Let also Cov(Q) denote the set of queries covered by an annotator. Then, we define:
Precision=
∑
q∈Q TP (q)
|Cov(Q)| , Recall=
∑
q∈Q TP (q)
|Q|
Figure 7.6 presents the Precision vs Recall plot for SAQ-MED, SAQ-LOW and the IG-X algorithms. Threshold
θ values for SAQ were in the range of0.001 ≤ θ ≤ 1000. Each point in the plot corresponds to a differentθ value.
The SAQ-based annotators and IG-0 achieve very high precision, with SAQ being a little better. To some extent
this is to be expected, given that these are “cleaner” queries, with every single query pre-classified to target the
structured data collection. Therefore, an annotator is less likely to misinterpret open-world queries as a request
for structured data. Notice, however, that the recall of theSAQ-based annotators is significantly higher than that
of IG-0. The IG-X annotators achieve similar recall forX > 0, but the precision degrades significantly. Note
also, that increasing the allowable free tokens from 1 to 5 does not give gains in recall, but causes a large drop in
precision. This is expected since targeted queries are unlikely to contain many free tokens.
Figure 7.6: Precision and Recall using Targeted Queries
Since the query data set is focused only on the tables we consider, we decided to stress-test our approach even
further: we set thresholdθ = 0, effectively removing the adaptable threshold separatingplausible and implausible
annotations, and considered only the most probable annotation. SAQ-MED precision was measured at 78% and
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 169
recall at 69% forθ = 0, versus precision 95% and recall 40% forθ = 1. This highlights the following points.
First, even queries targeting the structured data collection can have errors and the adaptive threshold based on
the open-language model can help precision dramatically. Note that errors in this case happen by misinterpreting
queries amongst tables or the attributes within a table, as there are no generic Web queries in this labeled data
set. Second, there is room for improving recall significantly. A query is often not annotated due to issues with
stemming, spell-checking or missing synonyms. For example, we do not annotate token “cannon” when it is used
instead of “canon”, or “hp” when used instead of “hewlett-packard”. An extended structured data collection using
techniques as in [38, 41] can result in significantly improved recall. Finally, we measured that in approximately
19% of the labeled queries, not a single token relevant to theconsidered table attributes was used in the query. This
means there was no possible mapping from the open language used in Web queries to the closed world described
by the available structured data.
7.6.3 Handling General Web Queries
Having established that the proposed solution performs well in a controlled environment where queries are known
to target the structured data collection, we now investigate its quality on general Web queries. We use the full log
of 38M queries, representative of an everyday Web search engine workload. These queries vary a lot in context
and are easy to misinterpret, essentially stress-testing the annotator’s ability to supress false positives.
We consider the same annotator variants: SAQ-MED, SAQ-LOW and IG-X. For each query, the algorithms
output a set of plausible annotations. For each alternative, a uniform random sample of covered queries was
retrieved and the annotations were manually labeled by 3 judges. A different sample for each alternative was
used; 450 queries for each of the SAQ variations and 150 queries for each of the IG variations. In total, 1350
queries were thoroughly hand-labeled. Again, to minimize the labeling effort, we only consider structured data
from the same 7 tables mentioned earlier.
The plausible structured annotations associated with eachquery were labeled asCorrect or Incorrect based
on whether an annotation was judged to represent a highly likely interpretation of the query over our collection of
tablesT . We measure precision as:
Precision=# of correct plausible annotations in the sample
# of plausible annotations in the sample
It is not meaningful to compute recall on the entire query setof 38 million. The vast majority of the Web
queries are general purpose queries and do not target the structured data collection. To compensate, we measured
coverage, defined as the number of covered queries, as a proxy ofrelative recall.
Figure 7.7 presents the annotation precision-coverage plot, for different threshold values. SAQ uses threshold
values ranging in1 ≤ θ ≤ 1000. Many interesting trends emerge from Figure 7.7. With respect to SAQ-MED
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 170
Figure 7.7: Precision and Coverage using General Web Queries
and SAQ-LOW, the annotation precision achieved is extremely high, ranging from 0.73 to 0.89 for SAQ-MED
and 0.86 to 0.97 for SAQ-LOW. Expectedly, SAQ-LOW’s precision is higher than SAQ-MED, as SAQ-MED is
more tolerant towards the presence of free tokens in a structured annotation. As discussed, free tokens have the
potential to completely distort the interpretation of the remainder of the query. Hence, by being more tolerant,
SAQ-MED misinterprets queries that contain free tokens more frequently than SAQ-LOW. Additionally, the effect
of the threshold on precision is pronounced for both variations: a higher threshold results value results in higher
precision.
The annotation precision of IG-1 and IG-5 is extremely low, demonstrating the challenge that free tokens
introduce and the value of treating them appropriately. Even a single free token (IG-1) can have a deleterious
effect on precision. However, even IG-0, which only outputsannotations withzero free tokens, offers lower
precision than the SAQ variations. The IG-0 algorithm, by not reasoning in a probabilistic manner, makes a
variety of mistakes, the most important of which to erroneously identify latent structured semantics in open-world
queries. The “white tiger” example mention in Section 7.1 falls in this category. To verify this claim, we collected
and labeled a sample of 150 additional structured annotations that were output by IG-0, but rejected by SAQ-MED
with θ = 1. SAQ’s decision was correct approximately 90% of the time.
With respect to coverage, as expected, the more conservative variations of SAQ, which demonstrated higher
precision, have lower coverage values. SAQ-MED offers higher coverage than SAQ-LOW, while increased thresh-
old values result in reduced coverage. Note also the very poor coverage of IG-0. SAQ, by allowingandproperly
handling free tokens, increases substantially the coverage, without sacrificing precision.
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 171
7.6.4 Understanding Annotation Pitfalls
We performed micro benchmarks using the hand-labeled data described in Section 7.6.3 to better understand why
the annotator works well and why not. We looked at the effect of annotation length, free tokens and structured
data overlap.
Number of free tokens
Figures 7.8(a) and 7.9(a) depict the fraction of correct andincorrect plausible structured annotations with respect
to the number of free tokens, for configurations SAQ-LOW (with θ = 1) and IG-5 respectively. For instance, the
second bar of 7.8(a) shows that 35% ofall plausible annotations contain 1 free token: 24% were correct, and 11%
were incorrect. Figures 7.8(b) and 7.9(b) normalize these fractions for each number of free tokens. For instance,
the second bar of Figure 7.8(b) signifies that of the structured annotations with 1 free token output by SAQ-LOW,
approximately 69% were correct and 31% were incorrect.
The bulk of the structured annotations output by SAQ-LOW (Figure 7.8) contain either none or one free token.
As the number of free tokens increases, it becomes less likely that a candidate structured annotation is correct.
SAQ-LOW penalizes large number of free tokens and only outputs structured annotations if it is confident of their
correctness. On the other hand, for IG-5 (Figure 7.9), more than 50% of structured annotations contain at least 2
free tokens. By using the appropriate probabilistic reasoning and dynamic threshold, SAQ-LOW achieves higher
precision even against IG-0 (zero free tokens) or IG-1 (zeroor one free tokens). As we can see SAQ handles the
entire gamut of free-token presence gracefully.
Figure 7.8: SAQ-LOW: Free tokens and precision.
Overall annotation length
Figures 7.10 and 7.11 present the fraction and normalized fraction of correct and incorrect structured annotations
outputted, with respect to annotationlength. The length of an annotation is defined as number of the annotated
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 172
Figure 7.9: IG-5: Free tokens and precision.
and free tokens. Note that Figure 7.11 presents results for IG-0 rather than IG-5. Having established the effect
of free tokens with IG-5, we wanted a comparison that focusesmore on annotated tokens, so we chose IG-0 that
outputs zero free tokens.
An interesting observation in Figure 7.10(a) is that although SAQ-LOW has not been constrained like IG-0 to
output structured annotations containing at least 2 annotated tokens, only a tiny fraction of its output annotations
contain a single annotated token. Intuitively, it is extremely hard toconfidentlyinterpret a token, corresponding
to a single attribute value, as a structured query. Most likely the keyword query is an open-world query that was
misinterpreted.
The bulk of mistakes by IG-0 happen for two-token annotations. As the number of tokens increases, it
becomes increasingly unlikely that all 3 or 4 annotated tokens from the same table appeared in the same query
by chance. Finally, note how different the distribution of structured annotations is with respect to the length of
SAQ-LOW (Figure 7.10(a)) and IG-0 (Figure 7.11(a)). By allowing free tokens in a structured annotation, SAQ
can successfully and correctly annotate longer queries, hence achieving much better recall without sacrificing
precision.
Figure 7.10: SAQ-LOW: Annotation length and precision.
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 173
Figure 7.11: IG-0: Annotation length and precision.
Types of free tokens in incorrect annotations
Free tokens can completely invalidate the interpretation of a keyword query captured by the corresponding struc-
tured annotation. Figure 7.12 depicts a categorization of the free tokens present in plausible annotations output by
SAQ and labeled asincorrect. The goal of the experiment is to understand the source of theerrors in our approach.
We distinguish four categories of free tokens:(i) Open-world altering tokens: This includes free tokens such
as “review”, “drivers” that invalidate the intent behind a structured annotation and take us outside the closed
world. (ii) Closed-world altering tokens: This includes relevant tokens that are not annotated due toincomplete
structured data and eventually lead to misinterpretations. For example, token “slr” is not annotated in the query
“nikon 35 mm slr” and as a result the annotation for Camera Lenses receives a high score.(iii) Incomplete closed-
world: This includes tokens that would have been annotated if synonyms and spell checking were enabled. For
example, query “panasonic video camera” gets misinterpreted if “video” is a free token. If “video camera” was
given as a synonym of “camcorder” this would not be the case.(iv) Open-world tokens: This contains mostly
stop-words like “with”, “for”, etc.
The majority of errors are in category (i). We note that a large fraction of these errors could be corrected
by a small amount of supervised effort, to identify common open-world altering tokens. We observe also that
the number of errors in categories (ii) and (iii) is lower forSAQ-LOW than SAQ-MED, since (a) SAQ-LOW is
more stringent in filtering annotations and (b) it down-weights the effect of free tokens and is thus hurt less by not
detecting synonyms.
Overlap on structured data
High vocabulary overlap between tables introduces a potential source of error. Table 7.1 presents a “confusion
matrix” for SAQ-LOW. Every plausible annotation in the sample is associated with two tables: the actual table
targeted by the corresponding keyword query (“row” table) and the table that the structured annotation suggests
as targeted (“column” table). Table 7.1 displays the row-normalized fraction of plausible annotations output for
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 174
Figure 7.12: Free tokens in incorrect annotations.
Predicted→
Actual↓
Cameras Camcorders Lenses Accessories OLM
Cameras 92% 2% 4% 2% 0%
Camcorders 4% 96% 0% 0% 0%
Lenses 2% 0% 94% 4% 0%%
Accessories 13% 3% 3% 81% 0%
OLM 7% 2% 0% 1% 90%
Table 7.1: Confusion matrix for SAQ-LOW.
each actual-predicted table pair. For instance, for 4% of the queries relevant to table Camcorders, the plausible
structured annotation identified table Digital Cameras instead. We note that most of the mass is on the diagonal,
indicating that SAQ correctly determines the table and avoids class confusion.The biggest error occurs on camera
accessories, where failure to understand free tokens (e.g., “batteries” in query “nikon d40 camera batteries”) can
result in producing high score annotations for the Cameras table.
7.6.5 Efficiency of Annotation Process
We performed an experiment to measure the total time required by SAQ to generate and score annotations for
the queries of our full Web log. The number of tables was varied in order to quantify the effect of increasing
table collection size on annotation efficiency. The experimental results are depicted in Figure 7.13. The figure
presents the mean time required to annotate a query: approximately1 millisecondis needed to annotate a keyword
query in the presence of 1176 structured data tables. Evidently, the additional overhead to general search-engine
query processing is minuscule, even in the presence of a large structured data collection. We also observe a
linear increase of annotation latency with respect to the number of tables. This can be attributed to the number
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 175
of structured annotations generated and considered by SAQ increasing at worst case linearly with the number of
tables.
The experiment was executed on a single server and the closedstructured model for all 1176 tables required
10GB of memory. It is worth noting that our solution is decomposable, ensuring high parallelism. Therefore,
besides low latency that is crucial for Web search, a production system can afford to use multiple machines to
achieve high query throughput. For example, based on a latency of 1ms per query, 3 machines would suffice for
handling a hypothetical Web search-engine workload of 250Mqueries per day.
0
0.2
0.4
0.6
0.8
1
0 500 1000
Tim
e p
er
Qu
ery
(m
s)
# of Tables
SAQ Linear (SAQ)
Figure 7.13: SAQ: On-line efficiency.
7.7 Comparison to Existing Work
A problem related to generating plausible structured annotations, referred to asWeb query tagging, was introduced
in [98]. Its goal is to assign each query term to a specified category, roughly corresponding to a table attribute.
A Conditional Random Field (CRF) is used to capture dependencies between query words and identify the most
likely joint assignment of words to “categories”. Query tagging can be viewed as a simplification of the query
annotation problem considered in this work. One major difference is that in [98] structured data are not organized
into tables.This assumption severely restricts the allowed applicability of the solution to multiple domains, as
there is no mechanism to disambiguate between arbitrary combinations of attributes. Second, the possibility of
not attributing a word to any specific category is not considered. This assumption is incompatible with the general
Web setting. Finally, training of the CRF is performed in asemi-supervisedfashion and hence the focus of [98]
is on automatically generating and utilizing training datafor learning the CRF parameters. Having said that, the
scale of the Web demands an unsupervised solution; anythingless will encounter issues when applied to diverse
structured domains.
Keyword search on relational [77, 101, 88], semi-structured [67, 102] and graph data [87, 72] (Keyword Search
CHAPTER 7. INTEGRATING STRUCTURED DATA INTO WEB SEARCH 176
Over Structured Data, abbreviated as KSOSD) has been an extremely active research topic. Its goal is the efficient
retrieval of relevant database tuples, XML sub-trees or subgraphs in response to keyword queries. The problem
is challenging since the relevant pieces of information needed to assemble answers are assumed to be scattered
across relational tables, graph nodes, etc. Essentially, KSOSD techniques allow users to formulate complicated
join queries against a database using keywords. The tuples returned are ranked based on the “distance” in the
database of the fragments joined to produce a tuple, and the textual similarity of the fragments to query terms.
The assumptions, requirements and end-goal of KSOSD are radically different from the Web query annota-
tion problem that we consider. Most importantly, KSOSD solutions implicitly assume that users are aware of
the presence and nature of the underlying data collection, although perhaps not its exact schema, and that they
explicitly intent to query it. Hence, the focus is on the assembly, retrieval and ranking of relevant results (tuples).
On the contrary, Web users are oblivious to the existence of the underlying data collection and their queries might
even be irrelevant to it. Therefore, the focus of the query annotation process is on discovering latent structure in
Web queries and identifying plausible user intent. This information can subsequently be utilized for the benefit of
structured data retrieval and KSOSD techniques. For a thorough survey of the KSOSD literature and additional
references see [40].
7.8 Conclusions
Integrating structured data into Web search presents unique and formidable challenges, with respect to both result
quality and efficiency. Towards addressing such problems wedefined the novel notion ofStructured Annotations
as a mapping of a query to a table and its attributes. We showedan efficient process that creates all such annota-
tions and presented a probabilistic scorer that has the ability to sort and filter annotations based on the likelihood
they represent meaningful interpretations of the user query. The end to end solution is highly efficient, demon-
strates attractive precision/recall characteristics andis capable of adapting to diverse structured data collections
and query workloads in a completely unsupervised fashion.
Chapter 8
Conclusions
Increasingly, research on textual data management adopts aview of documents that is aligned with their true com-
plexity: extended documentscomprised of both text and meta-data and document collections that aredynamicand
integrated(Chapters 1 and 2). In this context, we presented a gamut of efficient solutions enabling sophisticated,
novel functionality for interacting with textual data; techniques that leverage the extended nature of documents
and document collections. It is our hope that the material presented in this thesis will help stimulate both new
research work, as well as inspire the development of useful real-world applications.
As a concluding attempt towards this goal, we summarize in Section 8.1 the most important – and hopefully
useful – lessons we gained while working on this thesis. In Section 8.2 we discuss the potential utility of our
ideas, techniques and algorithms in applications other than the ones originally meant to support. In Section 8.3
we suggest possible research directions extending the ideas presented herein.
Finally, we note that besides documenting our techniques and results, we were actively involved in the proto-
typing of applications based on them. Grapevine1 [9] is an on-line application that allows its user to interactively
explore stories capturing attention in social media. At thetime of writing the system processes 2.5 Million blog
posts daily and tracks the stories being discussed across 660 Thousand demographic segments. The algorithms
powering the system and enabling its functionality are the ones described in Chapter 3. Additionally, the Web
query analysis technique presented in Chapter 7 is incorporated into Microsoft’s Helix project2 [122], an ongoing
effort on integrating structured data sources into Web search. Similarly, we are confident that all solutions pre-
sented are equally suitable for real applications, as corroborated by the real-data experimental results presented in