Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification Albert Angel University of Toronto [email protected]Nick Koudas University of Toronto [email protected]Nikos Sarkas University of Toronto [email protected]Divesh Srivastava AT&T Labs-Research [email protected]ABSTRACT Recent years have witnessed an unprecedented proliferation of so- cial media. People around the globe author, every day, millions of blog posts, micro-blog posts, social network status updates, etc. This rich stream of information can be used to identify, on an ongo- ing basis, emerging stories, and events that capture popular atten- tion. Stories can be identified via groups of tightly-coupled real- world entities, namely the people, locations, products, etc., that are involved in the story. The sheer scale, and rapid evolution of the data involved necessitate highly efficient techniques for identifying important stories at every point of time. The main challenge in real-time story identification is the main- tenance of dense subgraphs (corresponding to groups of tightly- coupled entities) under streaming edge weight updates (resulting from a stream of user-generated content). This is the first work to study the efficient maintenance of dense subgraphs under such streaming edge weight updates. For a wide range of definitions of density, we derive theoretical results regarding the magnitude of change that a single edge weight update can cause. Based on these, we propose a novel algorithm, DYNDENS, which outper- forms adaptations of existing techniques to this setting, and yields meaningful results. Our approach is validated by a thorough exper- imental evaluation on large-scale real and synthetic datasets. 1. INTRODUCTION Recent years have witnessed an unprecedented proliferation of social media. Millions of people around the globe author on a daily basis millions of blog posts, micro-blog posts and social network status updates. This content offers an uncensored window into cur- rent events, and emerging stories capturing popular attention. For instance, consider the U.S. military strike in Abbottabad, Pakistan in early May 2011, which resulted in the death of Osama bin Laden. This event was extensively covered on Twitter, the pop- ular micro-blogging service, significantly in advance of traditional media, starting with the live coverage of the operation by an (unwit- ting) local witness, to millions of tweets around the world providing Figure 1: Real-time identification of “bin Laden raid” story, and connection to ENGAGEMENT a multifaceted commentary on every aspect of the story. Similar, if fewer, online discussions cover important events on an everyday basis, from politics and sports, to the economy and culture (no- table examples from recent years range from the death of Michael Jackson, to revolutions in the Middle East and the economic re- cession). In all cases, stories have a strong temporal component, making timeliness a prime concern in their identification. Interestingly, such stories can be identified by leveraging the real-world entities involved in them (e.g. people, politicians, prod- ucts and locations) [26]. The key observation is that each post on the story will tend to mention the same set of entities, around which the story is centered. In particular, as post length restrictions or conventions typically limit the number of entities mentioned in a single post, each post will tend to mention entities corresponding to a single facet of a story. Thus, by identifying pairs of entities that are strongly associated (recurrently mentioned together), one can implicitly detect facets of the underlying event of which they are the main actors. By piecing together these aspects, the overall event of interest can be inferred. For example, in the case of the U.S. military strike mentioned above, one facet, consisting of people discussing the raid, is cen- tered around “Abbottabad” where the raid took place, and the in- volvement of the “C.I.A.”; another thread commenting on the pres- idential announcement, involves “Barack Obama” and “Osama bin Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 38th International Conference on Very Large Data Bases, August 27th - 31st 2012, Istanbul, Turkey. Proceedings of the VLDB Endowment, Vol. 5, No. 6 Copyright 2012 VLDB Endowment 2150-8097/12/02... $ 10.00. 574
12
Embed
Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dense Subgraph Maintenance under Streaming EdgeWeight Updates for Realtime Story Identification
Recent years have witnessed an unprecedented proliferation of so-
cial media. People around the globe author, every day, millions
of blog posts, micro-blog posts, social network status updates, etc.
This rich stream of information can be used to identify, on an ongo-
ing basis, emerging stories, and events that capture popular atten-
tion. Stories can be identified via groups of tightly-coupled real-
world entities, namely the people, locations, products, etc., that are
involved in the story. The sheer scale, and rapid evolution of the
data involved necessitate highly efficient techniques for identifying
important stories at every point of time.
The main challenge in real-time story identification is the main-
tenance of dense subgraphs (corresponding to groups of tightly-
coupled entities) under streaming edge weight updates (resulting
from a stream of user-generated content). This is the first work
to study the efficient maintenance of dense subgraphs under such
streaming edge weight updates. For a wide range of definitions
of density, we derive theoretical results regarding the magnitude
of change that a single edge weight update can cause. Based on
these, we propose a novel algorithm, DYNDENS, which outper-
forms adaptations of existing techniques to this setting, and yields
meaningful results. Our approach is validated by a thorough exper-
imental evaluation on large-scale real and synthetic datasets.
1. INTRODUCTIONRecent years have witnessed an unprecedented proliferation of
social media. Millions of people around the globe author on a daily
basis millions of blog posts, micro-blog posts and social network
status updates. This content offers an uncensored window into cur-
rent events, and emerging stories capturing popular attention.
For instance, consider the U.S. military strike in Abbottabad,
Pakistan in early May 2011, which resulted in the death of Osama
bin Laden. This event was extensively covered on Twitter, the pop-
ular micro-blogging service, significantly in advance of traditional
media, starting with the live coverage of the operation by an (unwit-
ting) local witness, to millions of tweets around the world providing
Figure 1: Real-time identification of “bin Laden raid” story,
and connection to ENGAGEMENT
a multifaceted commentary on every aspect of the story. Similar, if
fewer, online discussions cover important events on an everyday
basis, from politics and sports, to the economy and culture (no-
table examples from recent years range from the death of Michael
Jackson, to revolutions in the Middle East and the economic re-
cession). In all cases, stories have a strong temporal component,
making timeliness a prime concern in their identification.
Interestingly, such stories can be identified by leveraging the
real-world entities involved in them (e.g. people, politicians, prod-
ucts and locations) [26]. The key observation is that each post on
the story will tend to mention the same set of entities, around which
the story is centered. In particular, as post length restrictions or
conventions typically limit the number of entities mentioned in a
single post, each post will tend to mention entities corresponding
to a single facet of a story. Thus, by identifying pairs of entities
that are strongly associated (recurrently mentioned together), one
can implicitly detect facets of the underlying event of which they
are the main actors. By piecing together these aspects, the overall
event of interest can be inferred.
For example, in the case of the U.S. military strike mentioned
above, one facet, consisting of people discussing the raid, is cen-
tered around “Abbottabad” where the raid took place, and the in-
volvement of the “C.I.A.”; another thread commenting on the pres-
idential announcement, involves “Barack Obama” and “Osama bin
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 38th International Conference on Very Large Data Bases,August 27th - 31st 2012, Istanbul, Turkey.Proceedings of the VLDB Endowment, Vol. 5, No. 6Copyright 2012 VLDB Endowment 2150-8097/12/02... $ 10.00.
574
Laden”; and so on. The resulting overall story at some point of
time involves the union of these entities. Such sets of entities can
be then used by users of systems such as Grapevine [3] to enable
the interactive exploration of the story.
Given a measure to quantify the strength of association between
two entities (such as the Log-likelihood ratio [26], the χ2 measure,
or the correlation-coefficient [5], etc.), one can abstract the real-
time stream of posts giving rise to an evolving (weighted) entity
graph, denoting the pairwise entity association strength1. An im-
portant story can then be identified via a cohesive group of strongly
associated entity pairs; i.e. a dense subgraph in the entity graph,
given an appropriate definition of density. Moreover, note that, as
the entities in a story need to be presented to users to facilitate nav-
igation, story cardinality needs to be constrained to moderate sizes;
after all, it would not be very interesting or helpful to present users
with a story centered around 100 main entities. This process is
illustrated in Figure 1.
Every post that is published, results in the weight update of one
or more edges in the entity graph. The high frequency of post
generation, coupled with our need for timely reporting of emerg-
ing stories, necessitates that the identification of dense structures in
the entity graph be highly efficient. This work thus addresses the
problem of dENse subGrAph maintenance for edGE-weight update
streaMs under sizE constraiNTs, or ENGAGEMENT for brevity. Be-
sides being useful as-is for identifying stories from social media in
real-time, solutions to this problem can also be used as building
blocks for more complex computations; e.g. identified dense sub-
graphs can undergo diversification before being presented to the
user [2], or they can be reranked taking their external sparsity into
account, in order to identify (soft) clusters of associated entities.
Addressing ENGAGEMENT at web scales presents several chal-
lenges. Principal among these is that, a change in the weight of
a single edge, can impact the density of many subgraphs, neces-
sitating a potentially unbounded exploration of the entity graph.
Thus, any efficient solution to ENGAGEMENT needs to incremen-
tally maintain dense subgraphs , without recomputing them from
scratch. Moreover, there does not exist a single definition of graph
density suitable for all scenarios; selecting the most appropriate
definition for a given setting depends, for instance, on the perceived
relative importance of having large, versus well-connected, dense
subgraphs. Thus, solutions to ENGAGEMENT need to be applicable
under general notions of density; however, existing techniques are
only applicable to limited subsets of this problem.
In this context, in this work we propose DYNDENS, an efficient
algorithm for ENGAGEMENT. We theoretically quantify the magni-
tude of change in dense subgraphs that a single edge weight update
can cause. Based on this, we show how maintaining some sparse
subgraphs, in addition to dense ones, enables the incremental main-
tenance of dense subgraphs. The resulting algorithm, DYNDENS,
makes use of an efficient index for subgraphs, which decreases
memory consumption and processing effort. It is complemented
by theoretically sound heuristics, that can offer improved perfor-
mance. A comprehensive experimental evaluation on real and syn-
thetic data highlights the effectiveness of our approach.
To summarize, our main contributions in this work are:
i) Motivated by the need to identify emerging stories in real-
time, for a wide range of measures of entity association, we for-
malize the problem of dENse subGrAph maintenance for edGE-
weight update streaMs under sizE constraiNTs (ENGAGEMENT),
for a very broad notion of graph density.
ii) We propose an efficient algorithm DYNDENS, based on a
1The association measure can also incorporate notions of recencyof association, e.g. by including some form of temporal decay.
novel quantification of the maximum possible change caused by
a single edge weight update. By maintaining a small number of
sparse subgraphs, DYNDENS is able to efficiently and incremen-
tally compute dense subgraphs.
iii) We design an efficient dense subgraph index, which decreases
memory consumption and processing effort, and propose theoreti-
cally sound heuristics for DYNDENS that can offer improved per-
formance.
iv) We validate our techniques via a thorough experimental eval-
uation on both real and synthetic datasets.
The remainder of this paper is organized as follows: After pro-
viding a formal problem statement in Section 2, we present our
proposed algorithm DYNDENS in Section 3. We explore the the-
oretical basis for DYNDENS in Section 4, evaluate the proposed
techniques in Section 5, and discuss some improvements to DYN-
DENS in Section 6. Finally, we review related work in Section 7,
and conclude in Section 8.
2. FORMALIZATIONLet us now turn to defining ENGAGEMENT. At a high level, let
us consider a weighted graph, with a constant number of vertices.
At every discrete time interval, the weights of one or more edges
are adjusted (including potentially edge additions and removals).
The goal is to maintain, at each point of time, all subgraphs with
“density” greater than a given threshold.
Connections to real-time story identification: Before fully
formalizing the problem, let us first draw some connections to its
application in real-time story identification. In this context, ver-
tices correspond to real-world entities, and edge weights to their
(current) pairwise association strengths (the choice of association
strength measure will depend on characteristics of the specific prob-
lem instance; in Section 5 we discuss several such choices). We
assume that a procedure exists for processing streams of (entity-
annotated2) posts, and generating the appropriate edge weight up-
dates at each time interval (in Section 5 we discuss such procedures
for a variety of measures of interest).
Data model: We represent the problem domain as i) a complete
weighted graph G = (V, E) with N vertices, where wij is the
weight of edge between nodes i and j; and ii) a stream of edge
weight updates of the form updatei = (a, b, δ), signifying that
at time instant i, the weight of the edge between vertices a and bchanged from wab to wab + δ.
Density: We define subgraph density as follows: for every sub-
graph C ⊆ V , its density is dens(C) = score(C)S|C|
, where score(C)
=P
i,j∈C∧i<j(wij). Sn is a function quantifying the relative im-
portance of a subgraph’s cardinality, n, to its density; with the ap-
propriate choice of Sn, virtually all quantifications of graph density
can be represented.
Note that we do not consider counter-intuitive quantifications of
graph density, such as (but not limited to) a definition of density
where the removal of a vertex from an unweighted clique results in
an increase of its density. To safeguard against such quantifications
of density, we require that Sn have the following intuitive mono-
tonicity properties: nn−1
≤ Sn
Sn−1
≤ nn−2
.3 This encompasses
2The precise procedure used for identifying named entities in doc-uments, e.g. [3], is orthogonal to this work.3Observe that if Sn
Sn−1
> nn−2
, the density of an unweighted clique
will increase if vertices are removed . Moreover, observe that ifn
n−1> Sn
Sn−1
, in an unweighted graph the density of an 3-vertex
clique K3 will increase if it is augmented by a single vertex, con-nected with a single edge to one of the clique vertices.
575
the full spectrum of choices of density functions commonly used
in the literature; typical choices include Sn = n·(n−1)2
(thus den-
sity is defined as the average edge weight, favoring small, dense
subgraphs; we term this instantiation AVGWEIGHT), and Sn = n(thus density represents a generalized average node “degree”, fa-
voring large subgraphs; we term this case AVGDEGREE).
Cardinality constraint: Finally, let Nmax be a (user-specified)
maximum cardinality for subgraphs of interest. (In the context of
real-time story identification, this constraint ensures that any sub-
graphs identified are small enough to be used for navigation / ex-
ploration purposes - cf. Section 1).
ENGAGEMENT: Given the above, the goal of ENGAGEMENT is
to maintain, at every point of time i, the subgraphs (vertex subsets)
with density over a given threshold T , subject to cardinality con-
straints, i.e. {Vj |Vj ⊆ V ∧ dens(Vj) ≥ T ∧ |Vj | ≤ Nmax}. We
term these output-dense subgraphs.
Notation: Before going into the details of our proposed ap-
proach, let us introduce some useful notation.
We denote each vertex by a natural number, so V = {1, · · · , N}denotes the set of vertices in G.
Let ei be the i’th basis vector (an N -dimensional vector, with
value 1 in its i’th coordinate, and 0 elsewhere). We will denote a
subset C ⊆ V by its corresponding vector ~c =P
i∈Cei, and will
sometimes refer to either interchangeably; we will also on occasion
denote the cardinality of subset C as |~c|.
Let ~Γu be the neighborhood vector of vertex u: ~Γu =(w1u, w2u, · · · , wNu).
For convenience, we will also make use of the following nor-
malized version of Sn: Let gn = Sn
n·(n−1). By the monotonicity
properties of Sn, it follows that gn ≤ gn−1.
Unless explicitly stated, we will focus on the time instant where
the weight of the edge between vertices a and b is updated from
wab = w to w + δ. Whenever a quantity X can be affected by
this update, we will denote its value before the update as X - and
its value after the update as X+. We omit this superscript when
it does not affect results in any way. For example, wab- = w,
wab+ = w + δ.
3. THE DYNDENS APPROACHLet us now discuss how our proposed algorithm, DYNDENS,
identifies, at every point of time, all output-dense subgraphs.
Dense subgraphs and growth property: Observe that there is
an inherent tradeoff in the set of subgraphs that DYNDENS will
maintain, which we term “dense” subgraphs. At one extreme, DYN-
DENS could opt to maintain only output-dense subgraphs, with the
other extreme being to maintain all subgraphs. However, neither
of these is desirable: the former because it does not enable incre-
mental computation of output-dense subgraphs, the latter due to
its prohibitive costs. We will subsequently (Section 4.2) formally
quantify this tradeoff. For now, loosely speaking, we will say that
C is a dense subgraph iff it has density greater than a given thresh-
old T|C| (which is a function of the cardinality of C), and cardi-
nality of at most Nmax (for a complete list of density-related terms
used in this work cf. Table 1). Tn is defined in a manner that en-
sures that every dense graph with n vertices has at least one dense
subgraph with n−1 vertices (thus it is possible to identify all dense
subgraphs by “growing” dense subgraphs of smaller cardinalities).
Specifically, Tn is a monotonically increasing function of n with
the property Tn · gn > Tn−1 · gn−1. At a high level, this mono-
tonicity property ensures the desired containment property men-
tioned earlier (see Section 4 for details4). Moreover, we require
4Another way to view dense graphs is the following: Consider the
Table 1: Definitions of density-related propertiesSubgraph C is · · · iff
Table 2: Summary of main symbols usedSymbol Description
V Set of vertices in graphN Number of vertices in graphwij Weight of edge between vertices i and j~Γu Neighborhood vector of vertex udens(C) Density of C
dens(C) =
P
i,j∈C∧i<j (wij)
S|C|
Sn Quantifies relative importance of subgraphcardinality n to density
gn Normalized version of Sn: gn = Snn·(n−1)
AVGWEIGHT Case where Sn = n(n − 1)/2SQRTDENS Case where Sn =
√n(n − 1)
AVGDEGREE Case where Sn = nNmax Max. cardinality of subgraph to be returnedT Min. density for a subgraph to be returnedTn Min. density for subgraph of cardinality n to be denseδit Tunable parameter of DYNDENS, influences Tn
a, b Vertices that were just updatedx- quantity x before the updatex+ quantity x after the updatew Weight of edge (a, b) before the update, ie. wab
-
w + δ Weight of edge (a, b) after the update, ie. wab+
that TNmax = T .5 We discuss the concrete instantiation of Tn
used by DYNDENS in Section 4.2.
Edge weight updates: The basic operation of DYNDENS is to
maintain dense subgraphs, following the update of the weight of an
edge (a, b), from w to w+δ. If this impacts the set of output-dense
subgraphs, the latter is updated as well. Handling updates with δ <0 (i.e. where the weight of an edge decreases) is straightforward:
all dense subgraphs containing both a and b are examined, and their
density is decreased by an appropriate amount. If they are no longer
output-dense, this is reported; if, in addition, they are no longer
dense (losing-dense), they are evicted from the index.
Positive updates: Of greater interest is the case where δ > 0,
i.e. the edge weight update corresponds to an increase in weight.
In this case, additional subgraphs, that were not dense prior to the
update, might now be dense (newly-dense subgraphs). DYNDENS
leverages the growth property to compute these as follows:
measure normDens(C) = dens(C)T|C|
, consisting of a density mea-
sure, normalized by the threshold function Tn; a graph C is denseiff it has normDens(C) ≥ 1. While normDens(C) is not asuitable measure of density per se, it has the following importantgrowth property: every graph C has a subgraph C′ of cardinality|C′| = |C| − 1 with normDens(C′) ≥ normDens(C). Thiscontainment/growth property additionally implies that, if there areno dense subgraphs of cardinality n, there can be no dense sub-graphs of any cardinality > n.5Recall that Tn is an increasing function of n, and the set of main-tained subgraphs needs to include all output-dense subgraphs ofcardinality ≤ Nmax having density ≥ T .
576
Algorithm 1 Algorithm DYNDENS
Input: Updated edge (a, b), magnitude of update δ1: if δ < 0 then
2: Update the density of all dense subgraphs containing a and
b; evict losing-dense subgraphs from the index; report any
subgraphs that are no longer output-dense
3: return
4: for all dense subgraphs C st. a ∈ C ∨ b ∈ C do {// including
C = {a, b} if it is newly-dense}5: if a /∈ C or b /∈ C then
6: if C should be cheap-explored and C ∪ {a, b} is newly-
dense then
7: Add C ∪ {a, b} to the index, report it if it is output-
dense
8: explore(C ∪ {a, b}, 2)9: else
10: Update the density of C, report it if it just became output-
dense
11: explore(C, 1)
Cheap explore: DYNDENS will try to augment all dense sub-
graphs containing either a or b, with b or a, respectively; resulting
newly-dense subgraphs will be inserted into the dense subgraph
index. In some cases, this step alone is sufficient and/or can be ap-
plied only to a subset of these subgraphs (cf. Section 6) for details).
Explore: DYNDENS will try to augment dense subgraphs con-
taining both a and b, with one neighboring vertex; resulting newly-
dense subgraphs will be inserted into the dense subgraph index.
Exploration iterations: The above procedure may need to be
performed iteratively for newly-dense subgraphs discovered via ex-
ploration or cheap exploration. Interestingly, the iteration depth is
upper bounded by a corollary of the growth property. Specifically,
in Section 4.2, we define Tn parametrized by a parameter δit that
indirectly controls the number of dense subgraphs maintained by
DYNDENS. As we show in Section 4, we can guarantee that at
most ⌈ δδit
⌉ iterative exploration iterations need to be performed,
in order to identify all newly-dense subgraphs, following an edge
weight update of magnitude δ.
Explore all: In a few cases, the above exploration may need
to be performed on non-neighboring nodes as well, resulting in a
very costly procedure. In most cases, DYNDENS avoids performing
this procedure via a better, implicit representation of some dense
subgraphs in the index (cf. Section 3.2.3).
In one sentence, DYNDENS explores the neighborhood of some
materialized dense subgraphs, using pruning conditions for when
to stop exploring around a subgraph. The remainder of this section
aims to fill in the blanks in the preceding sentence. We discuss
the workings of DYNDENS, and illustrate them with a practical
example in Section 3.1, followed by important technical details in
Section 3.2. We defer the exposition of the theoretical results on
which DYNDENS is based till Section 4.
3.1 The DYNDENS AlgorithmLet us now discuss DYNDENS in greater detail, with reference to
Algorithm 1. At a high level, DYNDENS maintains an in-memory
index of all dense subgraphs (we defer discussing index implemen-
tation details to Section 3.2); at every edge weight update, it out-
puts information regarding subgraphs that became, or stopped be-
ing output-dense. If the edge weight update was negative, only
some index maintenance needs to be done (line 2). Otherwise,
some stable-dense subgraphs containing a and/or b are further ex-
Algorithm 2 Procedure explore(C, i)
Input: Subgraph C. Iteration number i1: if C was not too-dense before the update and i ≤ ⌈ δ
δit⌉ and
|C| < Nmax then
2: if C is too-dense then
3: for all y /∈ C do {// Explore-All}4: Add C ∪{y} to the index; report it if it is output-dense
5: explore(C ∪ {y}, i + 1)6: else
7: for all neighbors y of C do
8: if C ∪ {y} is newly-dense then
9: Add C ∪ {y} to the index; report it if it is output-
dense
10: explore(C ∪ {y}, i + 1)
amined (lines 4-11). Note that, to ensure correctness, also the sub-
graph {a, b} may be examined, even if it was not present in the
index (base case in line 4). Subgraphs in the index containing only
one of a, b are cheap-explored, if needed6 (line 6).
Subgraphs in the index that contain both a and b, as well as
newly-dense subgraphs previously identified, are subsequently ex-
plored (line 11) - i.e. DYNDENS will try to augment them with a
neighboring node (we defer discussing the precise details on how
this is done efficiently to Section 3.2). This will be recursively re-
peated on any newly-dense subgraphs discovered up to ⌈ δδit
⌉ times
(the theoretical results that enable this bounding are discussed in
Section 4). A high-level description of the exploration procedure is
shown in Algorithm 2.
Algorithm 2 will first ensure that the subgraph should be ex-
plored. Specifically, the subgraph should not have been too-dense
before the update (line 1), for otherwise its dense supergraphs would
have been stable-dense, and hence already identified. Moreover, as
previously mentioned, DYNDENS will not explore around any sub-
graph more times than necessary. Finally, in a few cases, explored
subgraphs will need to be augmented with every other vertex, not
just neighboring ones (Explore-All; line 2). As the latter is a costly
procedure, in Section 3.2.3 we will present a way to mitigate the
associated cost.
Execution example. To illustrate the workings of DYNDENS,
let us examine a simple example of its execution. Consider the
sample entity graph of Figure 2(a), and assume an AVGWEIGHT
definition of density (i.e. the density of a subgraph is its average
edge weight), a density threshold of T = 1, and a maximum de-
sired subgraph cardinality of Nmax = 4. Assume that δit has been
set to 0.15, so that the thresholds Tn, for subgraphs of cardinality nto be considered dense are T2 = 0.9, T3 = 0.975 and T4 = T = 1(cf. Section 4.2 for details). Thus, the dense subgraphs for this
graph are shown in Figure 2(b) (output-dense subgraphs are em-
phasized). Finally, assume that the weight of edge (1, 2) is updated
from 0.8 to 0.95 (δ = δit = 0.15). Let us examine how DYNDENS
will handle this update; to facilitate this discourse, the newly-dense
subgraphs that are inserted into the index are shown in the bottom
half of Figure 2(b).
At a high level, DYNDENS will examine {1, 2}, as well as all
i.e. {1, 3}, {1, 4}, {2, 3}, {2, 4}, {1, 3, 4}, {2, 3, 4}. {1, 2} will
6For instance, subgraphs that were too-dense need not be explored,as, by definition, their dense supergraphs would have been stable-dense, and hence already identified. Moreover, this step can alsobe skipped in other circumstances, cf. Section 6 for details.
577
(a) Entity graph
Subgraph Density output-
dense?
Dense, before update
1, 3 1.0 Y
1, 4 1.0 Y
2, 3 1.1 Y
2, 4 1.0 Y
3, 4 1.0 Y
1, 3, 4 1.0 Y
2, 3, 4 1.03 Y
newly-dense, after update
1, 2 0.95 N
1, 2, 3 1.016 Y
1, 2, 4 0.983 N
1, 2, 3, 4 1.0083 Y
(b) Dense subgraph index
Figure 2: Execution example
be added to the index (Algorithm 1, line 10), and will be explored
(line 11). Its exploration will entail the addition of newly-dense
subgraphs {1, 2, 3} and {1, 2, 4} to the index (Algorithm 2, line 8);
the former will also be reported as output-dense. Since δδit
= 1,
these newly-dense subgraphs will not be further explored (Algo-
rithm 2, line 10 and line 1). Moreover, during this exploration sub-
graph {1, 2, 5} will be examined, but as its density is less than T3,
it will not be added to the index.
DYNDENS will also cheap-explore subgraphs {1, 3}, {1, 4},
{2, 3}, {2, 4} (Algorithm 1, line 6). This will result in subgraphs
{1, 2, 3}, {1, 2, 4} being examined (twice) (Algorithm 1, line 7); as
they are already present in the index, this will not affect anything.
Moreover, DYNDENS will attempt to explore these subgraphs (Al-
gorithm 1, line 8); however, since δδit
= 1, they will not be ex-
plored (Algorithm 2, line 1).
Finally, DYNDENS will cheap-explore subgraphs {1, 3, 4} and
{2, 3, 4}. The first cheap exploration will result in newly-dense
subgraph {1, 2, 3, 4} being added to the index, and reported as
output-dense (Algorithm 1, line 7); the second one will revisit this
subgraph, and do nothing. Moreover, in both cases, since |{1, 2, 3,4}| = 4 ≥ Nmax, these subgraphs will not be explored (Algo-
rithm 2, line 1).
Observation: From the simplified execution example presented
above, one can observe that DYNDENS (as currently presented)
can end up performing redundant computations; e.g. some sub-
graphs are examined unnecessarily many times. Subsequently, in
Section 3.2.2 and Section 6, we discuss how to reduce such unnec-
essary computations.
3.2 Implementation ConsiderationsHaving presented DYNDENS at a high level, let us now see some
important considerations that arise when implementing it in prac-
tice. We first introduce the underlying indexing structure used by
DYNDENS in Section 3.2.1; this index also enables DYNDENS to
avoid redundant computations (Section 3.2.2) as well as the costly
operation of explore-all (Algorithm 2, line 2 cf. Section 3.2.3).
3.2.1 Index
DYNDENS requires an efficient index for both the evolving graph
itself, as well as for dense subgraphs. For the graph index, main-
taining node adjacency lists is sufficient (i.e. a mapping ∀u ∈ V :
u → ~Γu); this also enables the efficient exploration of a subgraph
Figure 3: Dense subgraph index
(via merging the relevant adjacency lists7).
The dense subgraph index is more interesting to examine, as it
needs to efficiently support several functionalities. To name a few:
for every dense subgraph, access to its vertices, cardinality and den-
sity; insertion, update and deletion of dense subgraphs from the in-
dex; iteration over all dense subgraphs containing vertices a or b,
where each subgraph must be accessed exactly one time (needed for
positive edge weight updates); and for a given dense subgraph C,
and a given vertex u, access to subgraph C ∪ {u}, and insertion of
C∪{u} into the index if it is not already present (needed for explo-
ration). Moreover, as DYNDENS needs to perform frequent random
accesses on dense subgraphs, the index needs to be in-memory, so
maintaining a low memory footprint is important. As most dense
subgraphs will tend to have high overlap, the dense subgraph index
should minimize the amount of redundant information stored.
To address these requirements posed by DYNDENS, we pro-
pose the following in-memory index. Each subgraph has a unique
id corresponding to its location in memory; it is also represented
by its (sorted) set of vertices. DYNDENS will maintain a pre-
fix tree of dense subgraphs, illustrated in Figure 3. Each node in
the prefix tree contains pointers to its children, indexed by ver-
tex id, a pointer to its parent, as well as information (such as car-
dinality and density) on the dense subgraph it represents, if ap-
plicable. Figure 3 shows a view of the index when subgraphs
the instantiation of Tn that will be used throughout this work is:
Tn =1
gn
„
gNmax · T + δit ·
„
n − 2
n − 1−
Nmax − 2
Nmax − 1
««
(2)
where δit is a tunable parameter. Note that this is a reasonable
value for Tn from a maintenance perspective ; for instance, if Sn =n, then Tn = (n−1)T2 +(n−2)δit = n−1
Nmax−1(T +δit)−δit =
O(n), while if Sn = n(n − 1), then Tn = T2 + (1 − 1n−1
)δit =
T − δit(1
n−1− 1
Nmax−1) = O(1).
Importantly, this instantiation results in a much simplified form
of Equation 1, specifically δ < δit. In the following, we will lever-
age this fact, to obtain a bound on the number of exploration itera-
tions that DYNDENS needs to perform.
Moreover, for our proposed techniques to be meaningful, it must
be the case that Tn >> 0 ∀n ∈ {2, · · · , Nmax}. This, along with
the above simplified form of Equation 1, leads to the following va-
lidity range for δit: δit ∈ (0,SNmax T
Nmax(Nmax−2)). The lower bound
would correspond to maintaining the smallest possible number of
subgraphs, and the upper bound to maintaining most subgraphs
(specifically, all subgraphs of cardinality Nmax, and most sub-
graphs of lower cardinalities) - realistically speaking, one should
not set δit to any value close to its upper bound.
4.3 Bounding the Number of IterationsWe are now able to extend Equation 1, to cases where δ > δit.
Specifically, we will show that all newly-dense subgraphs of car-
dinality n are contained in C0 ∪ C1 · · · ∪ C⌈ δδit
⌉ , thus in or-
der to compute all newly-dense subgraphs, it is sufficient to ex-
plore around stable-dense and newly-dense subgraphs contained in
C0 ∪ C1 ∪ · · · ∪ C⌈ δδit
⌉.
Proof sketch: An update of magnitude δ is equivalent to ⌈ δδit
⌉
updates of magnitude up to δit; furthermore, re-exploring stable-
dense subgraphs will not yield any new dense subgraphs, thus only
newly-dense subgraphs will need to be explored subsequently.
Discussion: As witnessed from the above result, the magnitude
of δ is directly correlated with the impact on dense subgraphs. A
useful analogy is that of an edge weight update as a perturbation:
the greater its magnitude δ, the further away in the graph its effects
can be potentially felt (i.e. the further away dense subgraphs will
need to be explored).
In this context, parameter δit offers a tunable space-time trade-
off. By setting it to higher values, more dense subgraphs will be
maintained, but fewer exploration iterations will be required per
edge update. By setting it to lower values, the space overhead
(i.e. the number of dense subgraphs maintained that are not output-
dense) can be made minimal: nearly 0 for AVGWEIGHT, and com-
parable to an offline approach otherwise8. Consequently, selecting
an optimal good value for δit is data-dependent; in practice, we ob-
serve that DYNDENS performs well for a wide range of δit values.
5. EVALUATIONLet us now discuss the experimental validation of our techniques.
We will first briefly go over the experimental setup. In Section 5.1
we will present experimental evidence for the feasibility of real-
time story identification via ENGAGEMENT, as well as the scala-
bility of our proposed approach. We will also examine the main
factors that contribute to the efficiency of DYNDENS.
As we have seen throughout this work, there is a lack of existing
techniques for efficiently addressing ENGAGEMENT. Nevertheless,
in Section 5.2 we evaluate adaptations of relevant techniques to this
problem, so as to have a basis for comparison.
Finally, although efficiency has been our main focus in this work,
in Section 5.3 we present some qualitative results that highlight the
effectiveness of our approach.
Experimental setup: All algorithms evaluated were imple-
mented in Java, and executed on 64-bit Hotspot VM, on a machine
with 8 Intel(R) Xeon(R) CPU E5540 cores clocked at 2.53GHz. In
our experiments, only one core was used, and the memory usage of
the JVM was capped at 25G of RAM (the actual memory consump-
tion was typically lower). Finally, in all performance experiments,
the time reported is the median time of 3 identical runs.
Datasets: Unless otherwise noted, all our experiments were run
using real-world datasets, based on a sample of all tweets for May
1st, 2011 (Our dataset consisted of 13.8M tweets. The sampling
was performed by Twitter itself, as part of the restricted access pro-
vided to its data stream; for details cf. tinyurl.com/twsam).
From these, we removed non-English tweets, and tweets that were
labeled as spam (using an in-house tweet spam filter [24]), resulting
in 3.8M tweets. Subsequently, we used an in-house entity extrac-
tor [3] to identify mentions of real-world entities (such as people,
politicians, products, etc). 76.5% of the tweets did not mention any
entity of interest; 18.3% mentioned one; 4.3% mentioned two, and
under 1% mentioned three or more entities. The entire procedure
took under 1h 20’ (under 350 µsec per tweet on average).
Measuring correlation: Given these sets of co-occurring enti-
ties, there are many ways in which entity association can be mea-
sured; our techniques are equally applicable, irrespective of the
measure used. For our evaluation, we selected two measures from
the literature that we found to yield meaningful results under di-
verse circumstances: a combination of the χ2 measure and the
correlation coefficient inspired by [5] (weighted dataset), that has
been found to be highly effective in identifying stories in the blogo-
sphere, as well as a thresholded variant of the log-likelihood ratio
[26] (unweighted dataset) that has been successfully used to iden-
tify stories in Grapevine over an extended period of time. In gen-
eral, we note that any measure that measures strength of pairwise
association, based on entity occurrences and pairwise co-occurrences
can equally be used by our techniques.
Identifying emerging stories: Since the goal of our techniques
8All exact offline approaches, to the best of our knowledge, utilizesome form of a growth property, hence need to compute as manysubgraphs as DYNDENS with δit ≃ 0
580
is to identify stories in real-time, i.e. “stories happening now”, a
mechanism for discounting older stories is required. To achieve
this, we modify our measures of correlation, by applying exponen-
tial decay to all entity occurrences and co-occurrences; for instance,
in our experiments we used a mean life for a tweet of 2 hours.
Note that our techniques are equally applicable without applying
any decay, but the stories identified would then correspond to “cu-
mulative stories to date” (cf. Table 3 showing stories for the entire
day) as opposed to “current emerging stories” (cf. online demo
www.onthegrapevine.ca/now.jsp).
Approximating complex association measures: Finally, for
many measures of association (e.g. statistical measures, such as
the log-likelihood ratio), the appearance of a document with just
a single entity, can influence the weight of all edges in the graph
(e.g. the log-likelihood ratio of a pair of entities is a function of the
number of documents that have appeared to date). This would pose
a significant challenge to incremental computations; to overcome
it, we make use of the following approximation, that is applicable
to any measure: the weight of an edge connecting entities e1, e2 is
computed by ignoring all documents that have appeared after the
latest time that either e1 or e2 appeared in some document.
Intuitively, this will not significantly affect edges connecting pop-
ular entities; indeed we observed that in practice the resulting drop
in precision entailed by this approximation was fairly low9. Impor-
tantly, this approximation enables us, after observing a document
that mentions entities e1, · · · , ej , to only update the weights of
edges that are incident to at least one of these entities are updated,
i.e. only the weights of edges {(ei, X)|i ∈ {1, · · · , j}, X ∈ V }will be updated.
Taking the above into account, the precise manner in which our
experimental datasets were created is as follows.
For every tweet where at least one entity was identified, en-
tity occurrences and co-occurrences were updated (taking expo-
nential decay into account, with a mean tweet life of two hours).
Thereafter, in the case of the weighted dataset, the χ2 and corre-
lation coefficient of salient entity pairs was updated; the updated
edge weight was computed as max( correlation coefficient , 0)if χ2 showed significant correlation (p < 5%), and 0 otherwise.
This procedure resulted in 952K positive and 40.5M negative edge
weight updates (recall that the latter are very cheap to process).
In the case of the unweighted dataset, the log-likelihood ratio
of salient entity pairs was updated. Two entities were connected
with an edge iff each entity appeared in at least 5 tweets, and log-
likelihood showed significant correlation (p < 1%). This proce-
dure resulted in 43K positive edge weight updates (edge additions),
and 41K negative ones (edge removals).
In either case, this step took under 90 seconds for the entire day.
The streams of edge weight updates were loaded to memory be-
fore initiating our experiments, and the updates were provided to
DYNDENS sequentially, and in-memory. This reflects the expected
usage of DYNDENS, as the edge weight updates that constitute its
input will typically be generated by another process in real-time.
All times reported correspond to the time required to process all
edge weight updates resulting from a dataset, while maintaining
output-dense subgraphs after each update. Specifically, they do not
9Specifically, we measured the error entailed by this approxima-tion, i.e. the absolute difference of the approximated value of eachedge weight, minus the actual value of the correlation measure, forall edges, at 100 uniformly distributed time instants. The medianerror over all edges was invariably 0; the average absolute errorover all edges and all time instants was 0.0003 for the weighteddataset, and 0.002 for the unweighted one, and the average relativeerror was 10% and 6% respectively.
include the time required to preprocess the dataset (e.g. entity ex-
traction, correlation computation), nor do they include the fixed
initialization costs of DYNDENS (such as JVM initialization and
initialization of necessary indexing structures). It is worth noting,
however, that the throughput of DYNDENS can more than match
the stream rate, even after factoring in all preprocessing steps (in
total, the overhead for all preprocessing and execution of DYN-
DENS for our dataset of one day was generally under 90 minutes;
moreover the most costly preprocessing steps - i.e. named entity
extraction- are inherently parallellizable).
5.1 Efficiency and ScalabilityLet us now examine some of our experimental findings. Fig-
ures 4(a)-4(d) show the time required to process all updates from
either dataset, for a variety of definitions of density (experiments
involving additional density functions can be found in [4]), and for
a wide range of values of density threshold T , maximum dense
subgraph cardinality Nmax. In these figures, δit has been set to
1% of its maximum value, given the values of the other parameters
(thus the number of maintained dense subgraphs is typically close
to the number of output-dense subgraphs). All runs were capped
at 10 minutes (runs that took longer than that were terminated); all
figures are cropped to exclude such time-outs10.
We observe that DYNDENS is able to very efficiently process
large datasets, across a wide range of useful operating parame-
ters, validating its applicability for efficiently addressing ENGAGE-
MENT. The chosen parameters range from instances with none,
or only a few output-dense subgraphs, to instances with too many
output-dense subgraphs (in the thousands); i.e. the extremal param-
eter values correspond to instances of less practical interest. Inter-
estingly, one can observe a sharp increase in performance beyond
certain values of parameters T and Nmax. This is due to the en-
suing sharp drop in the average number of output-dense subgraphs.
For instance, with reference to Figure 4(c), the average11 number
of output-dense subgraphs of cardinality at most 6, for T = 1 is
3.4K; for T = 0.8 it is 13.4K; while for T = 0.7 it is over 52K.
Similar trends can be observed in the other figures as well; cf. [4].
Having discussed the scalability and efficiency of DYNDENS, let
us now turn to evaluating its inner workings. Firstly, let us examine
the effects of the δit parameter. Recall that, low values of δit cor-
respond to DYNDENS materializing fewer dense subgraphs, and,
correspondingly, having to perform potentially more explorations.
In our experiments, we found our techniques to perform equally
well for a wide range of values of δit; however, selecting a value
for it, based on characteristics of the dataset can be beneficial to
performance. In Figure 4(e), we show the time taken by DYN-
DENS to process the unweighted dataset (note the semilog scale),
for Nmax = 10 and AVGWEIGHT, across all possible values for
δit (shown normalized to its maximum value for each threshold).
We observe an interesting local optimum wrt. δit, arising from the
tradeoff of having to materialize more subgraphs, while enabling
faster updates; i.e. increasing δit improves performance, up to a
point where the additional dense subgraphs that need to be main-
tained make this a performance drain. For instance, this point is
around 0.2 for T = 0.8, around 0.1 for T = 0.9, and around 0.6
for T = 1. It is also interesting to note that this tradeoff comes into
10The only data points that had terminated runs are outside the dis-played range; these instances had too large a number of output-dense subgraphs, as a result of unrealistic values for T, Nmax
and/or δit, and were not expected to finish in a reasonable time11Averaged over all updates, and excluding output-dense subgraphsthat are not represented in the index, e.g. most too-dense subgraphs,augmented with a non-neighboring node (cf. Section 3.2.3).
(e) Effects of δit, unweighted (f) Recall of GRASP, unweighted (g) Performance of GRASP rela-tive to DYNDENS, unweighted
(h) Effect of heuristics, synthetic
Figure 4: Experimental evaluation
play again for T = 1 and high δit.
As we previously saw in Section 3.2.3, IMPLICITTOODENSE
is crucially important for DYNDENS to operate efficiently, in the
presence of too-dense subgraphs. We validated this intuition exper-
imentally, by executing a variant of DYNDENS that did not make
use of IMPLICITTOODENSE, on the weighted dataset, and compar-
ing its runtime to that of DYNDENS. We experimented with exe-
cution parameters (Nmax ∈ {9, 10}, T ∈ [0.44, 0.5] and with δit
between 1% and 50% of its maximum value, given the values of the
other parameters. Invariably, the variant without IMPLICITTOOD-
ENSE took longer than 20 minutes to complete (and was killed after
20 minutes, in the interests of brevity), while DYNDENS took 40-
85 seconds to complete.
5.2 Comparison with Other TechniquesAs we have already discussed throughout this work, to the best of
our knowledge, prior to DYNDENS, no techniques have been pro-
posed for efficiently addressing ENGAGEMENT in its general form.
Thus, in order to have a basis for comparison, in this section we
evaluate adaptations of relevant techniques to subsets of ENGAGE-
MENT, namely the dynamic maximal clique algorithm proposed
in [27] (STIX), the Greedy Randomized Adaptive Search Proce-
dure used to identify large quasi-cliques in [1] (GRASP), as well
as a baseline efficient offline procedure that periodically recom-
putes all AVGWEIGHT dense subgraphs (BASELINE). We wish to
stress that, by its very nature, these comparisons are not fair, as the
goals of the aforementioned techniques are entirely different from
those of ENGAGEMENT, while said techniques are not as general
as DYNDENS.
Let us review each comparison in detail. The STIX algorithm
[27] identifies all maximal cliques in dynamic unweighted graphs.
This is similar to ENGAGEMENT for T = 1, AVGWEIGHT and
unweighted graphs, but subtly different, in that ENGAGEMENT re-
quires the identification of all cliques. Recall that the output of
ENGAGEMENT will be used to present stories to a human user, thus
the subgraphs produced cannot be too large. If STIX were used to
address ENGAGEMENT, and a maximal clique of cardinality e.g.
20 were identified, all its subgraphs of cardinality e.g. 5 or less
would need to be enumerated, and provided as output.
Keeping in mind the caveats above, we implemented STIX us-
ing an efficient in-memory hash-based index12, and executed it on
the unweighted dataset, measuring its execution time, and ignor-
ing the time that would be needed for enumerating all subgraphs of
maximal cliques. We compared this runtime to DYNDENS with
AVGWEIGHT, T = 1 (so as to have a basis for comparison),
Nmax = 5,13 and set δit to half its maximum value, given the
values of the other parameters.
Even though a comparison of STIX and DYNDENS is entirely
artificial, the runtime of STIX and DYNDENS were roughly equal:
STIX took 958 seconds to process the dataset, compared to 936
sec for DYNDENS. DYNDENS performed even better for lower
Nmax, and took more time for higher Nmax. Thus, we conclude
that DYNDENS is best suited to applications of ENGAGEMENT,
while STIX is preferable for applications that require identifying
maximal cliques in unweighted subgraphs.
Let us now review the comparison to GRASP, proposed in [1].
This is an approximate randomized algorithm for identifying large
dense subgraphs in unweighted graphs. While [1] has significantly
more general contributions, for the purposes of this discussion,
the algorithm proposed therein can be used to identify subgraphs
with density over a given threshold T , under AVGWEIGHT, in un-
weighted graphs. GRASP will not necessarily identify all dense
subgraphs, but can be executed multiple times per update, to iden-
tify an increasingly larger number of such subgraphs. It is im-
portant to note that, again, the comparison with DYNDENS is not
12[27] does not provide indexing details, so we opted for an efficientsolution, albeit with high memory consumption. We also experi-mented with an adaptation of STIX that used our proposed index,which has much lower memory requirements, but this invariablyresulted in increased runtime for STIX.
13Since the goal is story identification, we set Nmax to a low value,corresponding to story cardinalities suitable for humans.
582
Table 3: Top stories, May 1st 2011Pres. Obama announces killing of Osama bin Laden involving:
Barack Obama,U.S. House Permanent Select Committee onIntelligence,Osama bin Laden,NBC News
Commentary on death of bin Laden, comparison to famous
athletes involving14 : Barack Obama,LeBron James,DelonteWest,Osama bin Laden
Discussions on Lady Gaga’s activities involving: Lady Gaga,Galeria
Libya crisis:NATO Airstrike results in death of 3 grandchildren of
Gaddafi involving: NATO,Libya
Discussions on Harry Potter involving: Hermione Granger,DracoMalfoy,Bella Swan
News on Osama Bin Laden’s Death Spreads On Twitter
involving15 : Clint Eastwood,Barack Obama,U.S. House PermanentSelect Committee on Intelligence,Osama bin Laden,CBS News
straightforward, as GRASP is geared towards identifying a few large
dense subgraphs, as opposed to all dense subgraphs.
Nevertheless, we implemented GRASP, using an efficient hash-
based in-memory index 16. We set the parameter α that controls
its greediness vs. randomness tradeoff to 0.5, after ensuring this
did not result in any significant performance differences17. We ex-
ecuted GRASP on the unweighted dataset, for a varying number of
iterations per edge weight update (more iterations mean higher run-
time, and a higher likelihood of identifying more dense subgraphs),
and measured its runtime, and recall (fraction of output-dense sub-
graphs that it identified, excluding disconnected subgraphs, which
it does not produce). We limited GRASP to searching for subgraphs
of cardinalities up to Nmax = 5, and normalized the runtime of
GRASP to the runtime of DYNDENS for the same parameters18 (i.e.
the normalized runtime of DYNDENS is 1). The normalized run-
time of GRASP is reported in Figure 4(g), and its recall in Fig-
ure 4(f). As we can see, GRASP offers a runtime/recall tradeoff,
and can thus be at times more efficient than DYNDENS (however, in
such cases, it offers recall of under 80%). Moreover, GRASP offers
diminishing returns wrt. recall (i.e. it takes increasingly many iter-
ations to achieve arbitrarily high recall; even though the increase in
runtime is linear wrt. the number of iterations, the increase in recall
is decidedly sublinear). Thus, in this context, GRASP is best suited
to identifying a sample of all dense subgraphs. However, since high
recall is of crucial importance in story identification (missing 20%
of important stories would not generally be acceptable), DYNDENS
is best suited to addressing ENGAGEMENT in this setting.
Finally, we also investigated a simple baseline approach (BASE-
LINE), which periodically recomputes all output-dense subgraphs
wrt. AVGWEIGHT. The aim of this comparison was to validate
the necessity for incremental computation as opposed to periodic
offline recomputation. We implemented BASELINE using an effi-
cient hash-based in-memory index, and executed it on our experi-
mental datasets with varying parameters (T, Nmax), and at varying
uniform sampling intervals (i.e. every X tweets). We measured
14A Cleveland blogger compared Osama bin Laden to athlete Le-Bron James; the discussion continued on Twitter, resulting in asports-related meme around the death of bin Laden.
15C.Eastwood was mentioned in conjunction with this story as partof a humorous meme started by comedian Steve Martin on Twitter.
16The index used in [1] is optimized for secondary storage, hencenot very useful for the purposes of our comparison.
17The average (over the values of all other parameters tested) stan-dard deviation of varying α ∈ (0, 1) was 4%, and the median stan-dard deviation was 1%.
18For DYNDENS we selected a reasonable value of δit, given thevalues of the rest of the parameters.
the number of recomputations that BASELINE was able to perform,
given the same time as DYNDENS took for the entire dataset
Even given the above restricted problem setting, we observed
that BASELINE was generally not up to the task of realtime story
identification. In our weighted dataset, and for a wide range of
parameters, it was able to perform up to 15-30 recomputations in
the same time that DYNDENS processed the entire dataset (corre-
sponding to identifying new stories every 48-96 minutes19). In the
unweighted dataset (which had on average fewer edges, and was
thus more amenable to reprocessing from scratch), BASELINE did
somewhat better, performing 135-300 recomputations for the pa-
rameters we experimented with (corresponding to identifying new
stories about every 5-10 minutes). More detailed results can be
found in [4]. We conclude that, although periodic recomputation
may be an option in limited scenarios (e.g. unweighted graphs,
AVGWEIGHT, not very strict realtime requirements), in general the
performance benefits of incremental recomputation are needed to
support realtime story identification.
5.3 Qualitative ResultsWhereas the focus of this work is to efficiently identify dense
subgraphs in an incremental manner, we also provide evidence of
the effectiveness of our approach. Evaluating the quality of our re-
sults for realtime story identification is both inherently challenging,
due to the lack of a ground truth for what constitutes an important
story for a given medium (e.g. a micro-blogging site vs. a news
agency), as well as beyond the scope of this work. We will thus
present some sample results of utilizing dense subgraphs for story
identification. We have also built a live demo for our techniques,
which we will briefly discuss, and encourage interested readers to
visit so as to view this work in action.
In order to present sample results, we chose to focus on stories
at the granularity of a single day (since presenting stories that were
heavily discussed at a specific date and time would be hard to pro-
cess out of context). We used a dataset similar to the “unweighted”
one from our performance experiments, with the following two
modifications: entity correlations were computed over the entire
dataset, as opposed to using exponential decay; and edge weights
were retained for pairs of entities with log likelihood of over 5%
significance, rather than being thresholded and restricted to {0, 1}.
We computed dense subgraphs of cardinality up to Nmax = 5, us-
ing AVGDEGREE to quantify density, so as to favor larger dense
subgraphs; for presentation purposes these were subsequently re-
ranked in a diversity-aware manner [2] (subgraph overlap was pe-
nalized by multiplying subgraph density by 1 − 0.8 · ( fraction of
story entities covered by previous stories) ).
Table 3 presents the resulting top stories. We observe that dis-
cussions on bin Laden’s death feature prominently in the list; more-
over, given the typical conversation tone on Twitter, distinct discus-
sions involved comparing the presidential announcement to famous
athletes14, and even the rapid propagation of the news on Twitter.
Other stories cover the evolving crisis in Libya, as well as lighter,
ongoing issues, such as Harry Potter, and Lady Gaga’s antics.
For comparative purposes, we also performed the same proce-
dure on a dataset consisting of all blog posts made on major blog
hosting platforms during the same day; due to space constraints the
results can be found in [4].
Finally, to validate the effectiveness our approach, we have built
a live demo of our techniques, in the context of Grapevine [3]. This
prototype processes millions of blog posts on a daily basis, and
computes important stories in real-time. It consists of a pipeline
19As our dataset corresponds to tweets made in one day.
583
that processes blog posts as they are crawled, rejecting spam and
non-english language posts, extracts named entity mentions, up-
dates the entity graph, and uses DYNDENS to update the set of cur-
rent dense subgraphs, as in the “unweighted” dataset used in our
experiments. It also maintains track of output-dense subgraphs,
which are reported to the user upon request. Besides the entities
involved in each output-dense subgraph/story, a few links to rele-
vant blog posts are provided, as well as a link back to Grapevine
for further exploration of the historical evolution of the story. In-
terested readers are encouraged to explore this prototype, available
at www.onthegrapevine.ca/now.jsp .
6. HEURISTICSIn concluding our exposition of DYNDENS, let us also exam-
ine two additional heuristics that can offer modest performance im-
provements, without affecting the quality of results. Both are re-
lated to limiting the number of explorations, and cheap explorations
performed. Due to space constraints, the full details for these, and
proofs of their correctness, are omitted, and can be found in [4].
MAXEXPLORE: Whereas it serves to prove the correctness of
DYNDENS, the previous bound on exploration iterations that need
to be performed on a subgraph C is overly pessimistic, as it is based
on several worst-case assumptions. To overcome this challenge,
we developed MAXEXPLORE, an improvement over the previous
bound, that takes the graph neighborhood of the updated edge, as
well as the cardinality of the subgraph being explored, into account.
As it is a fairly cheap bound to compute, we can expect MAXEX-
PLORE to lead to performance improvements in the case of dense
subgraphs on which multiple exploration iterations would have oth-
erwise been performed.
DEGREEPRIORITIZE: Another challenge in the basic form of
DYNDENS discussed so far, is that a single graph might be ex-
plored multiple times, by exploration procedures originating from
each of its dense subgraphs. To mitigate the adverse effects this can
have on performance, we developed DEGREEPRIORITIZE, a way to
organize the search space, and thus often avoid performing redun-
dant explorations, inspired by the degree-based criterion proposed
in [28]. At a high level, it guarantees that DYNDENS does not need
to explore (or cheap-explore) a subgraph with vertices having dense
connections to the subgraph. We thus expect DEGREEPRIORITIZE
to offer the greatest benefit to performance in cases of dense sub-
graphs on which redundant, multiple-iteration explorations would
have otherwise been performed.
Evaluation: In our evaluation of DYNDENS, the above heuris-
tics were enabled. Thus, to evaluate their performance benefits, we
also evaluated variants of DYNDENS where either DEGREEPRIOR-
ITIZE and/or MAXEXPLORE were disabled, on both our weighted
and unweighted datasets. We observed that these heuristics were
responsible for very modest performance improvements of up to
4%, and sometimes even resulted in worse performance.
By design, we expect the proposed heuristics to offer perfor-
mance benefits in cases where many explorations would have other-
wise been performed in their absence. To validate this, and further
investigate their potential to improve performance, we evaluated
them on a synthetic dataset that consisted of near-cliques, mixed
with random edges, that was generated as follows: In an initially
empty graph with 100K vertices, 250K updates were generated,
each of magnitude (0, 0.1] (with probability 0.3 the update was
negative). With probability 0.9, the update occurred within one
of 100 predefined sets of 10 vertices each; otherwise, it was uni-
formly randomly distributed to the remainder of the graph. Finally,
in order to evaluate the proposed heuristics in the absence of too-
dense subgraphs, updates that would result in too-dense subgraphs
for T = 0.7 and δit at 40% of its maximum value, were rejected.
Figure 4(h) shows the time taken by each DYNDENS variant (no
heuristics enabled, only DEGREEPRIORITIZE enable, only MAX-
EXPLORE enabled, both heuristics enabled), normalized by the time
taken by the first variant; the operating parameters were T = 0.7,
Nmax ∈ {8, 9, 10}, and δit at 40% of its maximum value (note
that the Y axis does not start at 0). The proposed heuristics are
seen to offer performance improvements of up to over 10%; thus,
while not as crucial as IMPLICITTOODENSE to performance, we
believe that the low effort required to implement these heuristics
make them worthwhile for inclusion in DYNDENS.
7. RELATED WORKWhile we are not aware of any work that addresses the main-
tenance of dense subgraphs in weighted graphs, under streaming
edge weight updates, for a broad definition of density, there exists
a rich literature of works dealing with related problems.
[27] addresses incremental maximal clique maintenance, from a
mostly theoretical perspective, and using a growth property. This
is very closely related to a special case of ENGAGEMENT (namely,
for unweighted graphs, AVGWEIGHT, and T = 1). An important
difference is that our instantiation of ENGAGEMENT deals with all
cliques, with cardinality constraints, as opposed to maximal cliques
of unconstrained cardinality. As discussed in Section 5.2, while the
former is better suited to real-time story identification, the latter
may be preferable in other scenarios.
[28] addresses near-clique identification, in an offline setting,
again from a mostly theoretical perspective, and using a growth
property; this corresponds to the offline version of ENGAGEMENT
for unweighted graphs, and AVGWEIGHT. The techniques pro-
posed therein cannot be efficiently dynamized in a straight-forward
fashion, as the information they rely upon cannot be efficiently
maintained across updates. Our DEGREEPRIORITIZE pruning con-
dition is inspired by the parent degree-based criterion proposed in
this work. [23] addresses the same problem, using a similar growth
property, and with a focus on a parallel implementation. As with
the other works, the techniques developed therein are not straight-
forward to efficiently dynamize.
Max (quasi-) clique: Related problems occur in the maximum
clique [25] and quasi-clique literature. To overcome the intractabil-
ity and inapproximability of this problem, heuristics (typically ran-
domized) have been used to discover large (quasi-) cliques. A cru-
cial difference is that ENGAGEMENT requires the enumeration of
all dense subgraphs (as from an application perspective, each sub-
graph corresponds to a story of interest). In contrast, works in the
maximum (quasi-) clique domain are geared towards identifying
one “good” subgraph per execution iteration. Moreover, most such
heuristic techniques are not straightforward to efficiently dynamize.
Perhaps most closely related is the state-of-the-art Greedy Ran-
domized Adaptive Search Procedure used in [1] to identify large
dense subgraphs (quasi-cliques). Although this work is more fo-
cused towards developing techniques for limited main-memory sce-
narios, their techniques can be dynamized in an efficient manner to
address ENGAGEMENT for unweighted graphs and AVGWEIGHT
(cf. Section 5.2).
Local density: Other works have dealt with edge-weight update
semantics, albeit with much simpler definitions of density. For in-
stance [30] and others maintain dense subgraphs over sliding win-
dows using neighbor-based patterns (i.e. whether a dense subgraph
should be augmented with an additional node is decided based on
local information only). As the problem being addressed therein
is very different from ENGAGEMENT, the proposed techniques are
inapplicable in the latter domain.
584
Max-flow: [12], [20] and others use (primarily) max-flow based
algorithms to identify dense subgraphs. While max-flow algorithms
can be dynamized [22], [18], these algorithms can only identify and
maintain clusters containing user-specified nodes. In a related vein,
[14] uses max-flow to find the top-1 dense subgraph (for AVGDE-
GREE); however their techniques cannot be efficiently applied to a
top-k or threshold variant, nor can they be efficiently dynamized.
Dynamic graphs: Other works (e.g. [10], [6]) have dealt with
dynamic graph algorithms under edge weight updates, but do not
deal with density problems, focusing instead on properties such as
planarity, connectivity, triangle counting, etc. A notable exception
is [17], which discusses approximation algorithms to general maxi-
mization problems in dynamic graphs. It is, however, theoretical in
nature, and its focus is on the approximation ratio of the resulting
algorithm, not on efficiency.
Clustering: Related problems are also dealt with in the incre-
mental clustering literature (e.g. [11], [15], [8]); however, these
deal with graph node insertion and deletion, and the proposed tech-