BinRank Scaling Dynamic Authority-Based Search Using Materialized Subgraphs

BinRank: Scaling Dynamic Authority-BasedSearch Using Materialized Subgraphs

Heasoo Hwang, Andrey Balmin, Berthold Reinwald, and Erik Nijkamp

Abstract—Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic

link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a query-

time PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs, and not feasible at

query time. Alternatively, building an index of precomputed results for some or all keywords involves very expensive preprocessing.

We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in

traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword

query can be answered by running ObjectRank on only one of the subgraphs. BinRank generates the subgraphs by partitioning all the

terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random

walk starting points, and keeping only those objects that receive non-negligible scores. The intuition is that a subgraph that contains all

objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these

terms. We demonstrate that BinRank can achieve subsecond query execution time on the English Wikipedia data set, while producing

high-quality search results that closely approximate the results of ObjectRank on the original graph. The Wikipedia link graph contains

about 108 edges, which is at least two orders of magnitude larger than what prior state of the art dynamic authority-based search

systems have been able to demonstrate. Our experimental evaluation investigates the trade-off between query execution time, quality

of the results, and storage requirements of BinRank.

Index Terms—Online keyword search, ObjectRank, scalability, approximation algorithms.

Ç

1 INTRODUCTION

THE PageRank algorithm [1] utilizes the Web graph linkstructure to assign global importance to Web pages. It

works by modeling the behavior of a “random Web surfer”who starts at a random Web page and follows outgoinglinks with uniform probability. The PageRank score isindependent of a keyword query. Recently, dynamicversions of the PageRank algorithm have become popular.They are characterized by a query-specific choice of therandom walk starting points. In particular, two algorithmshave got a lot of attention: Personalized PageRank (PPR) forWeb graph data sets [2], [3], [4], [5] and ObjectRank forgraph-modeled databases [6], [7], [8], [9], [10].

PPR is a modification of PageRank that performs search

personalized on a preference set that contains Web pages

that a user likes. For a given preference set, PPR performs a

very expensive fixpoint iterative computation over the

entire Web graph, while it generates personalized search

results. Therefore, the issue of scalability of PPR hasattracted a lot of attention [3], [4], [5].

ObjectRank [6] extends (personalized) PageRank toperform keyword search in databases. ObjectRank uses aquery term posting list as a set of random walk startingpoints and conducts the walk on the instance graph of thedatabase. The resulting system is well suited for “highrecall” search, which exploits different semantic connectionpaths between objects in highly heterogeneous data sets.ObjectRank has successfully been applied to databases thathave social networking components, such as bibliographicdata [6] and collaborative product design [9].

However, ObjectRank suffers from the same scalabilityissues as personalized PageRank, as it requires multipleiterations over all nodes and links of the entire databasegraph. The original ObjectRank system has two modes:online and offline. The online mode runs the rankingalgorithm once the query is received, which takes too longon large graphs. For example, on a graph of articles ofEnglish Wikipedia1 with 3.2 million nodes and 109 millionlinks, even a fully optimized in-memory implementation ofObjectRank takes 20-50 seconds to run, as shown in Fig. 3. Inthe offline mode, ObjectRank precomputes top-k results fora query workload in advance. This precomputation is veryexpensive and requires a lot of storage space for precom-puted results. Moreover, this approach is not feasible for allterms outside the query workload that a user may search for,i.e., for all terms in the data set dictionary. For example, onthe same Wikipedia data set, the full dictionary precompu-tation would take about a CPU-year.

1176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 8, AUGUST 2010

. H. Hwang is with the Department of Computer Science and Engineering,University of California, San Diego, 9500 Gilman Drive, Mail Code 0404,La Jolla, CA 92093-0404. E-mail: [email protected].

. A. Balmin and B. Reinwald are with IBM Almaden Research Center,650 Harry Rd., San Jose, CA 95120.E-mail: [email protected], [email protected].

. E. Nijkamp is with the Technische Universitat Berlin, Straße des 17. Juni135, D-10623 Berlin, Germany.E-mail: [email protected].

Manuscript received 15 May 2009; revised 16 Sept. 2009; accepted 26 Nov.2009; published online 3 May 2010.Recommended for acceptance by Y. Ioannidis, D. Lee, and R. Ng.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDESI-2009-05-0428.Digital Object Identifier no. 10.1109/TKDE.2010.85. 1. http://en.wikipedia.org.

1041-4347/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 09,2010 at 09:48:20 UTC from IEEE Xplore. Restrictions apply.

In this paper, we introduce a BinRank system that employsa hybrid approach where query time can be traded off forpreprocessing time and storage. BinRank closely approxi-

mates ObjectRank scores by running the same ObjectRankalgorithm on a small subgraph, instead of the full data graph.The subgraphs are precomputed offline. The precomputation

can be parallelized with linear scalability. For example, on thefull Wikipedia data set, BinRank can answer any query in lessthan 1 second, by precomputing about a thousand subgraphs,

which takes only about 12 hours on a single CPU.BinRank query execution easily scales to large clusters by

distributing the subgraphs between the nodes of the cluster.This way, more subgraphs can be kept in RAM, thusdecreasing the average query execution time. Since thedistribution of the query terms in a dictionary is usuallyvery uneven, the throughput of the system is greatlyimproved by keeping duplicates of popular subgraphs onmultiple nodes of the cluster. The query term is routed tothe least busy node that has the corresponding subgraph.

There are two dimensions to the subgraph precomputa-

tion problem: 1) how many subgraphs to precompute and2) how to construct each subgraph that is used forapproximation. The intuition behind our approach is that

a subgraph that contains all objects and links relevant to aset of related terms should have all the information neededto rank objects w.r.t. one of these terms. For 1), we group all

terms into a small number (around 1,000 in case ofWikipedia) of “bins” of terms based on their co-occurrencein the entire data set. For 2), we execute ObjectRank for each

bin using the terms in the bins as random walk startingpoints and keep only those nodes that receive non-negligiblescores.

Our experimental evaluation highlights the tuning of thesystem needed to balance the query performance with sizeand number of the precomputed subgraphs. Intuitively,query performance is highly correlated to the size of thesubgraph, which, in turn, is highly correlated with thenumber of documents in the bin. Thus, normally, it issufficient to create bins with a certain size limit to achieve aspecific target running time. However, there is somevariability in the process and some bins may still result inunusually large subgraphs and slow queries. To address this,we employ an adaptive iterative process that further splitsthe problematic subgraphs to guarantee that a vast majorityof queries will be executed within the allotted time budget.

Other approximation techniques have been consideredbefore to improve scalability of dynamic authority-basedsearch algorithms. Monte Carlo algorithms are introducedin [4] and [5] for approximation during precomputation.HubRank [8] uses the same approximation as [4], butperforms precomputation only for “hub” nodes. Othertechniques might also suggest sampling-based techniquesonline. However, although these techniques claim onlinequery processing, they have only been demonstrated ongraphs with less than 106 links. In contrast, we demonstratesuperior scalability of our approach on a Wikipedia graphthat is two orders of magnitude larger. We also show thatour approximation using ObjectRank itself is more precisethan the sampling-based techniques.

Our contributions are:

. The idea of approximating ObjectRank by usingMaterialized subgraphs (MSGs), which can beprecomputed offline to support online querying fora specific query workload, or the entire dictionary.

. Use of ObjectRank itself to generate MSGs for “bins”of terms.

. A greedy algorithm that minimizes the number ofbins by clustering terms with similar posting lists.

. Extensive experimental evaluation on the Wikipediadata set that supports our performance and searchquality claims. The evaluation demonstrates super-iority of BinRank over other state-of-the-art approx-imation algorithms.

The rest of the paper is organized as follows: We startwith a survey of related work in Section 2. We give anoverview of the ObjectRank algorithm in Section 3. Materi-alized subgraphs are introduced in Section 4, and the binconstruction algorithm is described in Section 5. In Section 6,we suggest the adaptive MSG recomputation method thatimproves the performance of BinRank. Section 7 describesthe architecture of the BinRank system. Section 8 walksthrough the experimental evaluation. We conclude inSection 9.

2 RELATED WORK

The issue of scalability of PPR [3] has attracted a lot ofattention. PPR performs a very expensive fixpoint iterativecomputation over the entire graph, while it generatespersonalized search results. To avoid the expensive iterativecalculation at runtime, one can naively precompute andmaterialize all the possible personalized PageRank vectors(PPVs). Although this method guarantees fast user responsetime, such precomputation is impractical as it requires ahuge amount of time and storage especially when done onlarge graphs. In this section, we examine hub-based andMonte Carlo style methods that address the scalabilityproblem of PPR, and give an overview of HubRank [8]that integrates the two approaches to improve the scal-ability of ObjectRank. Even though these approachesenabled PPR to be executed on large graphs, they eitherlimit the degree of personalization or deteriorate the qualityof the top-k result lists significantly.

Hub-based approaches materialize only a selected subset ofPPVs. Topic-sensitive PageRank [2] suggests materializationof 16 PPVs of selected topics and linearly combining them atquery time. The personalized PageRank computationsuggested in [3] enables a finer-grained personalization byefficiently materializing significantly more PPVs (e.g., 100 K)and combining them using the hub decomposition theoremand dynamic programming techniques. However, it is stillnot a fully personalized PageRank, because it can persona-lize only on a preference set subsumed within a hub set H.

Monte Carlo methods replace the expensive power itera-tion algorithm with a randomized approximation algorithm[4], [5]. In order to personalize PageRank on any arbitrarypreference set with maintaining just a small amount ofprecomputed results, Fogaras et al. [4] introduce thefingerprint algorithm that simulates the random walk modelof PageRank and stored the ending nodes of sampledwalks. Since each random walk is independent, fingerprint

HWANG ET AL.: BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS 1177


generation can be easily parallelized and the quality of

search results improves as the number of fingerprints

increases. However, as mentioned in [4], the precision of

search results generated by the fingerprint algorithm is

somewhat less than that of power-iteration-based algo-

rithms, and sometimes, the quality of its results may be

inadequate especially for nodes that have many close

neighbors. In [5], a Monte Carlo algorithm that takes into

account not only the last visited nodes, but also all visited

nodes during the sampled walks, is proposed. Also, it

showed that Monte Carlo algorithms with iterative start

outperform those with random start.HubRank [8] is a search system based on ObjectRank that

improved the scalability of ObjectRank by combining theabove two approaches. It first selects a fixed number of hubnodes by using a greedy hub selection algorithm that utilizesa query workload in order to minimize the query executiontime. Given a set of hub nodes H, it materializes thefingerprints of hub nodes in H. At query time, it generatesan active subgraph by expanding the base set with itsneighbors. It stops following a path when it encounters ahub node whose PPV was materialized, or the distance fromthe base set exceeds a fixed maximum length. HubRankrecursively approximates PPVs of all active nodes, terminat-ing with computation of PPV for the query node itself.During this computation, the PPV approximations aredynamically pruned in order to keep them sparse. As statedin [8], the dynamic pruning takes a key role in out-performing ObjectRank by a noticeable margin. However,by limiting the precision of hub vectors, HubRank may getsomewhat inaccurate search results, as stated in [8]. Also,since it materialized only PPVs ofH, just as [3], the efficiencyof query processing and the quality of query results are verysensitive to the size of H and the hub selection scheme.Finally, Chakrabarti [8] did not show any large-scaleexperimental results to verify the scalability of HubRank.

In Section 8, we perform quality and scalability experi-

ments on the full English Wikipedia data set exported in

October 2007, to show that BinRank is an efficient

ObjectRank approximation method that generates a high-

quality top-k list for any keyword query in the corpus. For

comparative evaluation of the performance of BinRank, we

implemented the Monte Carlo algorithm 4 in [5] that was

shown to outperform other variations in [5]. We also

implemented HubRank [8] to check its scalability on our

Wikipedia data set.Unlike [4] which proves the convergence to the exact

solution on arbitrary graphs, and [8] and [3] which offerexact methods at the expense of limiting the choice ofpersonalization, our solution is entirely heuristic. However,extensive experimental evaluation confirms that on real-world graphs, BinRank can strike a good balance betweenquery performance and closeness of approximation.

3 OBJECTRANK BACKGROUND

In this section, we describe the essentials of ObjectRank [6],

[9], [10]. We first explain the data model and query

processing, and then, discuss the result quality and

scalability issues that motivate this paper.

3.1 Data Model

ObjectRank performs top-k relevance search over a databasemodeled as a labeled directed graph. The data graph GðV ;EÞmodels objects in a database as nodes, and the semanticrelationships between them as edges. A node v 2 V containsa set of keywords and its object type. For example, a paper in

a bibliographic database can be represented as a nodecontaining its title and labeled with its type, “paper.” Adirected edge e 2 E from u to v is labeled with its relation-ship type �ðeÞ. For example, when a paper u cites another

paper v, ObjectRank includes in E an edge e ¼ ðu! vÞ thathas a label “cites.” It can also create a “cited by”—type edgefrom v to u. In ObjectRank, the role of edges between objectsis the same as that of hyperlinks between Web pages in

PageRank. However, note that edges of different edge typesmay transfer different amounts of authority. By assigningdifferent edge weights to different edge types, ObjectRankcan capture important domain knowledge such as “a paper

cited by important papers is important, but citing importantpapers should not boost the importance of a paper.” Let wðtÞdenote the weight of edge type t. ObjectRank assumes thatweights of edge types are provided by domain experts.

3.2 Query Processing

For a given query, ObjectRank returns top-k objects relevantto the query. We first describe the intuition behindObjectRank, introduce the ObjectRank equation, and then,elaborate on important calibration factors.

ObjectRank query processing can be illustrated using the

random surfer model. A random surfer starts from a random

node vi among nodes that contain the given keyword. These

random surfer starting points are called a base set. For a given

keyword t, the keyword base set of t,BSðtÞ, consists of nodes

in which t occurs. Note that any node in G can be part of the

base set, which makes ObjectRank support the full degree of

personalization. At each node, the surfer follows outgoing

edges with probability d, or jumps back to a random node in

the base set with probability ð1� dÞ.2 At a node v, when it

determines which edge to follow, each edge eoriginated from

v is chosen with probability wð�ðeÞÞOutDegð�ðeÞ;vÞ , where OutDegðt; vÞ

denotes the number of outgoing edges of vwhose edge types

are the same as t. The ObjectRank score of vi is the probability

rðviÞ that a random surfer is found at vi at a certain moment.Let r denote the vector of ObjectRank scores ½rðv1Þ; . . . ;

rðviÞ; . . . ; rðvnÞ�T , and A be an n� n matrix with Aij, the

probability that a random surfer moves from vj to vi bytraversing an edge. Also, let q be a normalized base setvector s

jBSðtÞj , where jBSðtÞj is the size of base set BSðtÞ and s

is a base set vector ½sv1; . . . ; svi ; . . . ; svn �

T , where svi ¼ 1 if vi is

in BSðtÞ and 0 otherwise. The ObjectRank equation is

r ¼ dArþ ð1� dÞq: ð1Þ

For a given query t, the ObjectRank algorithm uses thepower iteration method to get the fixpoint of r, theObjectRank vector w.r.t. t, where the ðkþ 1Þth ObjectRank

vector is calculated as follows:


2. Throughout this paper, we assume that d ¼ 0:85.


rðkþ1Þ ¼ dArðkÞ þ ð1� dÞq: ð2Þ

The algorithm terminates when r converges, which is

determined by using a term convergence threshold

�t ¼ �jBSðtÞj . The constant � is one of the main ObjectRank

calibration parameters, as it controls the speed of conver-

gence and the precision of r.

3.3 Quality and Scalability

ObjectRank returns top-k search results for a given queryusing both the content and the link structure in G. Since itutilizes the link structure that captures the semanticrelationships between objects, an object that does notcontain a given keyword but is highly relevant to thekeyword can be included in the top-k list. This is in contrastto the static PageRank approach that only returns objectscontaining the keyword sorted according to their PageRankscore. This key difference is one of the main reasons forObjectRank’s superior result quality, as demonstrated bythe relevance feedback survey reported in [6].

However, the iterative computation of ObjectRank

vectors described in Section 3.2 is too expensive to

execute at runtime. For a given query, ObjectRank iterates

over the entire graph G to calculate the ObjectRank vector

r until jrðkþ1Þi � rðkÞi j is less than the convergence threshold

for every rðkþ1Þi in rðkþ1Þ and r

ðkÞi in rðkÞ. This is a very

strict stopping condition. This iterative computation may

take a very long time if G has a large number of nodes

and edges. Therefore, instead of evaluating a keyword

query at query time, the original ObjectRank system [6]

precomputes the ObjectRank vectors of keywords in H,

the set of keywords, during the preprocessing stage, and

then, stores a list of <ObjId; RankV alue> pairs per

keyword. However, the preprocessing stage of ObjectRank

is expensive, as it requires jHj ObjectRank executions and

OðjV j � jHjÞ bits of storage. In fact, according to the worst-

case bounds for PPR index size proven in [4], the index

size must be �ðjV j � jHjÞ bits, for any system that returns

the exact ObjectRank vectors.

4 RELEVANT SUBGRAPHS

Our goal is to improve the scalability of ObjectRank whilemaintaining the high quality of top-k result lists. We focuson the fact that ObjectRank does not need to calculate theexact full ObjectRank vector r to answer a top-k keywordquery (K � jV j). We identify three important properties ofObjectRank vectors that are directly relevant to the resultquality and the performance of ObjectRank. First, for manyof the keywords in the corpus, the number of objects withnon-negligible ObjectRank values is much less than jV j.This means that just a small portion of G is relevant to aspecific keyword. Here, we say that an ObjectRank value ofv, rðvÞ is non-negligible if rðvÞ is above the convergencethreshold. The intuition for applying the threshold is thatdifferences between the scores that are within the thresholdof each other are noise after ObjectRank execution. Thus,scores below threshold are effectively indistinguishablefrom zero, and objects that have such scores are not at all

relevant to the query term. Second, we observed that top-k

results of any keyword term t generated on subgraphs of G

composed of nodes with non-negligible ObjectRank values,

w.r.t. the same t, are very close to those generated on G.

Third, when an object has a non-negligible ObjectRank

value for a given base set BS1, it is guaranteed that the

object gains a non-negligible ObjectRank score for another

base set BS2 if BS1 � BS2. Thus, a subgraph of G composed

of nodes with non-negligible ObjectRank values, w.r.t. a

union of base sets of a set of terms, could potentially be

used to answer any one of these terms.Based on the above observations, we speed up the

ObjectRank computation for query term q, by identifying asubgraph of the full data graph that contains all the nodesand edges that contribute to accurate ranking of the objectsw.r.t. q. Ideally, every object that receives a nonzero scoreduring the ObjectRank computation over the full graphshould be present in the subgraph and should receive thesame score. In reality, however, ObjectRank is a searchsystem that is typically used to obtain only the top-k resultlist. Thus, the subgraph only needs to have enoughinformation to produce the same top-k list. We shall callsuch a subgraph a Relevant subgraph (RSG) of a query.

Definition 4.1. The top-k result list of the ObjectRank of

keyword term t on data graph GðV ;EÞ, denoted by

ORðt; G; kÞ, is a list of k objects from V sorted in descending

order of their ObjectRank scores w.r.t. a base set that is the set

of all objects in V that contain keyword term t.

Definition 4.2. A Relevant Subgraph (RSGðt; G; kÞ) of a data

graph GðV ;EÞ w.r.t. a term t and a list size k is a graph

GsðVs; EsÞ such that Vs � V , Es � E, and ORðt; G; kÞ ¼ORðt; Gs; kÞ.

It is hard to find an exact RSG for a given term, and it is not

feasible to precompute one for every term in a large workload.

However, we introduce a method to closely approximate

RSGs. Furthermore, we observed that a single subgraph can

serve as an approximate RSG for a number of terms, and that

it is quite feasible to construct a relatively small number of

such subgraphs that collectively cover, i.e., serve as approx-

imate RSGs, all the terms that occur in the data set.

Definition 4.3. An Approximate Relevant Subgraph

(ARSGðt; G; k; cÞ) of a data graph GðV ;EÞ with respect to

a term t, list size k, and confidence limit c 2 ½0; 1�, is a graph

GsðVs; EsÞ such that Vs � V , Es � E, and �ðORðt; G; kÞ;ORðt; Gs; kÞÞ > c.

Kendall’s � is a measure of similarity between two lists of

[11]. This measure is commonly used to describe the quality of

approximation of top-k lists of exact ranking (RE) and

approximate ranking (RA) that may contain ties (nodes with

equal ranks) [4], [8]. A pair of nodes that is strictly ordered in

both lists is called concordant if both rankings agree on the

ordering, and discordant otherwise. A pair is e-tie, if RE does

not order the nodes of the pair, and a-tie, ifRA does not order

them. Let C, D, E, and A denote the number of concordant,

discordant, e-tie, and a-tie pairs, respectively. Then, Kendall’s

� similarity between two rankings, RE and RA, is defined as



�ðRE;RAÞ ¼C �Dffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðM � EÞðM �AÞp ;

whereM is the total number of possible pairsM ¼ nðn�1Þ2 and

n ¼ jRE [RAj. We linearly scale � to ½0; 1� interval as in [4], [8].

Definition 4.4. An ARSG cover of a data graph GðV ;EÞ, w.r.t.a keyword term workload W , list size k, and confidence limitc 2 ½0; 1�, is a set of graphs � such that for every term t 2W ,there exists Gs 2 � that is ARSGðt; G; k; cÞ, and inversely,every Gs 2 � is an ARSGðt; G; k; cÞ for at least one termt 2W .

We construct an ARSG for term t by executing Objec-tRank with some set of objects B as the base set andrestricting the graph to include only nodes with non-negligible ObjectRank scores NORðBÞ, i.e., those above theconvergence threshold �t of the ObjectRank algorithm. Wecall the induced subgraph G½NORðBÞ� a materializedsubgraph for set B, denoted by MSGðBÞ.

The main challenge of this approach is identifying a baseset B that will provide a good RSG approximation for termt. We focus on sets B that are supersets of the base set of t.This relationship gives us the following important result:

Theorem 4.5. If BS1 � BS2, then (v 2MSGðBS1Þ ) v 2MSGðBS2Þ).

Proof. Let BS1 and BS2 be subsets of V that satisfyBS1 � BS2. Also, let r1, r2, and r2n1 be the ObjectRankvectors and q1, q2, and q2n1 be the normalized base setvectors corresponding to BS1, BS2, and ðBS2 �BS1Þ,respectively. Then, by applying the linearity theorem in[3] on the ObjectRank (1), we get the following equation:

�1r1 þ �2n1r2n1 ¼ dAð�1r1 þ �2n1r2n1Þþ ð1� dÞð�1q1 þ �2n1q2n1Þ;

where �1 ¼ jBS1jjBS2j and �2n1 ¼ jBS2�BS1j

jBS2j . Since BS1 � BS2,

�1 þ �2n1 ¼ 1, which satisfies the linearity theorem.

Note that since �1q1 þ �2n1q2n1 ¼ q2, �1r1 þ �2n1r2n1 ¼r2 holds.

Now, let us consider a node v 2 G is in MSGðBS1Þ.Since we just showed r2 ¼ �1r1 þ �2n1r2n1, r2ðvÞ ¼�1r1ðvÞ þ �2n1r2n1ðvÞ also holds. Thus, r2ðvÞ �1r1ðvÞ,because �2n1 > 0 and r2n1ðvÞ 0. Also, since v 2MSGðBS1Þ, r1ðvÞ > �

jBS1j by definition of MSG.

Since �1 ¼ jBS1jjBS2j and r1ðvÞ > �

jBS1j , r2ðvÞ �1r1ðvÞ >jBS1jjBS2j �

�jBS1j ¼

�jBS2j . Since r2ðvÞ > �

jBS2j , by definition of

MSG, v 2MSGðBS2Þ. tuAccording to this theorem, for a given term t, if the term

base set BSðtÞ is a subset of B, all the important nodesrelevant to t are always subsumed within MSGðBÞ, i.e., allthe non-negligible end points of random walks originatedfrom starting nodes containing t are present in the subgraphgenerated using B.

However, note that even though two nodes v1 and v2 areguaranteed to be found both in G and in MSGðBÞ, theordering or their ObjectRank scores might not be preservedon MSGðBÞ as we do not include intermediate nodes iftheir ObjectRank scores are below the convergence thresh-old. Missing intermediate nodes could deteriorate thequality of ObjectRank scores computed on MSGðBÞ.However, it is unlikely that many walks terminating on

relevant nodes will pass through irrelevant nodes. Thus,even if MSGðB;GÞ is not an RSGðt; G; kÞ, it is very likely tobe ARSGðt; G; k; cÞ with high confidence c. Our experi-mental evaluation supports this intuition.

In this paper, we construct MSGs by clustering all theterms of the dictionary, or of a query workload if one isavailable, into a set of term “bins.” We create a base set Bfor every bin by taking the union of the posting lists of theterms in the bin and construct MSGðBÞ for every bin. Weremember the mapping of terms to bins, and at query time,we can uniquely identify the corresponding bin for eachterm and execute the term on the MSG of this bin.

Theorem 4.5 supports our intuition that a bin’s MSG isvery likely to be an ARSG for each term in the bin with fairlyhigh confidence. Thus, the set of all bin MSGs will be anARSG cover with sufficiently high confidence. Our empiricalresults support this claim. For example, after a reasonabletuning of parameter settings (� ¼ 0:0005 and maximum Bsize of 4,000 documents), 90 percent of our random workloadterms ran on their respective bin MSGs with �ðORðt; G; 100Þ;ORðt;MSG; 100ÞÞ > 0:9. Moreover, the other 10 percent ofterms, which had �100 < 0:9, were all very infrequent terms.The most frequent among them appeared in eight docu-ments. �100 tends to be relatively small for infrequent terms,because there simply may not be 100 objects with meaningfulrelationships to the base set objects.

5 BIN CONSTRUCTION

As outlined above, we construct a set of MSGs for terms of adictionary or a workload by partitioning the terms into a setof term bins based on their co-occurrence. We generate anMSG for every bin based on the intuition that a subgraphthat contains all objects and links relevant to a set of relatedterms should have all the information needed to rankobjects with respect to one of these terms.

There are two main goals in constructing term bins. First,controlling the size of each bin to ensure that the resultingsubgraph is small enough for ObjectRank to execute in areasonable amount of time. Second, minimizing the numberof bins to save the preprocessing time. After all, we knowthat precomputing ObjectRank for all terms in our corpus isnot feasible.

To achieve the first goal, we introduce a maxBinSizeparameter that limits the size of the union of the postinglists of the terms in the bin, called bin size. As discussedabove, ObjectRank uses the convergence threshold that isinversely proportional to the size of the base set, i.e., the binsize in case of subgraph construction. Thus, there is a strongcorrelation between the bin size and the size of thematerialized subgraph. As show in Section 8, the value ofmaxBinSize should be determined by quality and perfor-mance requirements of the system.

The problem of minimizing the number of bins is NP-hard. In fact, if all posting lists are disjoint, this problemreduces to a classical NP-hard bin packing problem [12]. Weapply a greedy algorithm that picks an unassigned termwith the largest posting list to start a bin and loops to add theterm with the largest overlap with documents already in thebin. We use a number of heuristics to minimize the requirednumber of set intersections, which dominate the complexity



of the algorithm. The tight upper bound on the number of

set intersections that our algorithm needs to perform is the

number of pairs of terms that co-occur in at least onedocument. To speed-up the execution of set intersections for

larger posting lists, we use KMV synopses [13] to estimate

the size of set intersections.The algorithm in Fig. 1 works on term posting lists from a

text index. As the algorithm fills up a bin, it maintains a list of

document IDs that are already in the bin, and a list of

candidate terms that are known to overlap with the bin (i.e.,their posting lists contain at least one document that was

already placed into the bin). The main idea of this greedyalgorithm is to pick a candidate term with a posting list that

overlaps the most with documents already in the bin, without

posting list union size exceeding the maximum bin size.While it is more efficient to prepare bins for a particular

workload that may come from a system query log, it isdangerous to assume that a query term that has not beenseen before will not be seen in the future. We demonstratethat it is feasible to use the entire data set dictionary as theworkload, in order to be able to answer any query.

Due to caching of candidate intersection results in lines 12-14 of the algorithm, the upper bound on the number of setintersections performed by this algorithm is the number ofpairs of co-occurring terms in the data set. Indeed, in theworst case, for every term t that has just been placed into the

bin, we need to intersect the bin with every term t0 that co-occurs with t, in order to check if t0 is subsumed by the bincompletely, and can be placed into the bin “for free.”

For example, consider N terms with posting lists of size

X each that all co-occur in one document d0 with no other

co-occurrences. If maximum bin size is 2ðX � 1Þ, a bin will

have to be created for every term. However, to get to that

situation, our algorithm will have to check intersections for

every pair of terms. Thus, the upper bound on the number

of intersections is tight.In fact, it is easy to see from the above example that no

algorithm that packs the bins based on the maximumoverlap can do so with fewer than NðN � 1Þ=2 setintersections in the worst case. Fortunately, real-world textdatabases have structures that are far from the worst case,as shown in Section 8.

Lastly, we show that the number of bins the algorithm

uses to pack a set of posting lists is at most 2�OPT , where �

indicates that the degree of overlap across posting lists and

OPT is minimal. Note that since BinRank constructs an

MSG for each bin during preprocessing, 2�OPT is also the

upper bound of the number of MSGs.

Theorem 5.1. Given a set of posting lists S of Sis, suppose that

there exists � 1 such thatP

Si2S jSij �jSSi2S Sij. Then,

the approximation ratio of PackTermsIntoBins is 2�.

Proof. Let OPT and OPT 0 denote the optimal number of

bins and the number of bins PackTermsIntoBins uses.

. Claim1: OPT P

Si2SjS

SijmaxBinSize�� .

Since no bin can hold a total capacity of more

than maxBinSize,

OPT jSSi2S Sij

maxBinSize:

Also, since � satisfies

[

Si2SSi

��

�� P

Si2S jSij�

;

OPT jSSi2S Sij

maxBinSize

PSi2S j

SSij

maxBinSize � � :

; Claim1 holds.. Claim2: j

SSi2S Sij > ðOPT

0 � 1Þ � maxBinSize2 .Since no more than one bin is less than half full,

jSSi2S Sij > ðOPT

0 � 1Þ � maxBinSize2 . Also, sincePSi2S jSij �j

SSi2S Sij for � 1,

PSi2S jSij

jSSi2S Sij. ; Claim2 holds.

By Claim 1 and Claim 2, OPT P

Si2SjS

SijmaxBinSize�� >

OPT 0�12� ,

i.e., OPT > OPT 0�12� . ; OPT 0 2�OPT tu

6 ADAPTIVE MSG RECOMPUTATION

We construct bins of up to a certain number of documents

based on the intuition that a limited bin size will limit the

resulting MSG size, which, in turn, will limit the running time

of the query. As we demonstrate in Section 8, this intuition

holds for the average case; however, for a small minority of

MSGs and queries, the running time can be 2-3 times higher

than the average. Fortunately, we can detect problematic


Fig. 1. Bin computation algorithm.


MSGs and replace them with more efficient ones during thepreprocessing stage.

Recall that the ObjectRank running time scales linearlywith two parameters: the number of iterations required andthe size of the graph. The number of iterations is correlated tothe size of the base set, so for a given MSG, queries with thelargest base sets are going to be the slowest. And for querieswith fixed sized base sets, the running time will largelydepend on the number of links in the graph. In fact, we reportin Section 8.5 a 94 percent correlation between the number oflinks on an MSG and the BinRank running time for querieswith large base sets. This observation enables us to reliablyidentify problematic MSGs based only on their link counts.

However, the correlation between the bin sizes and theMSG link counts is less obvious. Fig. 15 shows that the link-count for MSGs follows a normal distribution even with allthe Bin and MSG generation parameters fixed. Thus, settingthe generation parameters in a way that no MSG exceeds acertain link-count threshold is not going to be practical.Instead, we set the parameters in such a way that only asmall minority of MSGs exceed the limit, and then, dealwith this minority separately.

One way to deal with dangerously large MSGs is torecompute them with a larger convergence threshold, thusmaking them smaller. However, this may diminish thesubsequent query result quality, so instead we choose tokeep the same �, but regenerate the bins that produced theseMSGs with a smaller maxBinSize.

To do this, we introduce a new threshold maxMSGSizeand generate a set of rejected bins RB that resulted in MSGswith the number of links larger than maxMSGSize. Wethen generate a new set of workload terms W 0, whichconsists of all the keywords of all bins in RB, and rerun thePackTermsIntoBins algorithm with W 0 and the newmaxBinSize set to the half of the original one. The newset of bins replaces RB, and the new MSGs are produced

and tested against the maxMSGSize. If some MSGs still failthe test, the process can be repeated iteratively.

7 SYSTEM ARCHITECTURE

Fig. 2 shows the architecture of the BinRank system. Duringthe preprocessing stage (left side of figure), we generateMSGs as defined in Section 4. During query processingstage (right side of figure), we execute the ObjectRankalgorithm on the subgraphs instead of the full graph andproduce high-quality approximations of top-k lists at asmall fraction of the cost. In order to save preprocessingcost and storage, each MSG is designed to answer multipleterm queries. We observed in the Wikipedia data set that asingle MSG can be used for 330-2,000 terms, on average.

7.1 Preprocessing

The preprocessing stage of BinRank starts with a set ofworkload terms W for which MSGs will be materialized. Ifan actual query workload is not available, W includes theentire set of terms found in the corpus. We exclude from Wall terms with posting lists longer than a system parametermaxPostingList. The posting lists of these terms are deemedtoo large to be packed into bins. We execute ObjectRank foreach such term individually and store the resulting top-klists. Naturally, maxPostingList should be tuned so thatthere are relatively few of these frequent terms. In the case ofWikipedia, we used maxPostingList ¼ 2;000 and only381 terms out of about 700,000 had to be precomputedindividually. This process took 4.6 hours on a single CPU.

For each term w 2W , BinRank reads a posting list Tfrom the Lucene3 index and creates a KMV synopsis T 0 thatis used to estimate set intersections.

The bin construction algorithm, PackTermsIntoBins,partitions W into a set of bins composed of frequently co-


Fig. 2. System architecture.

3. http://lucene.apache.org.


occurring terms. The algorithm takes a single parametermaxBinSize, which limits the size of a bin posting list, i.e.,the union of posting lists of all terms in the bin. During thebin construction, BinRank stores the bin identifier of eachterm into the Lucene index as an additional field. Thisallows us to map each term to the corresponding bin andMSG at query time.

The ObjectRank module takes as input a set of binposting lists B and the entire graph GðV ;EÞ with a set ofObjectRank parameters, the damping factor d, and thethreshold value �. The threshold determines the conver-gence of the algorithm as well as the minimum ObjectRankscore of MSG nodes.

Our ObjectRank implementation stores a graph as a row-compressed adjacency matrix. In this format, the entireWikipedia graph consumes 880 MB of storage and can beloaded into main memory for MSG generation. In case thatthe entire data graph does not fit in main memory, we canapply parallel PageRank computation techniques such ashypergraph partitioning schemes described in [14].

The MSG generator takes the graphG and the ObjectRank

result w.r.t. a term bin b, and then, constructs a subgraph

GbðV 0; E0Þ by including only nodes with rtðuÞ �b. �b is the

convergence threshold of b, that is, �jBSðbÞj . Given the set of

MSG nodes V 0, the corresponding set of edges E0 is copied

from the in-memory copy of G. The edge construction takes

1.5-2 seconds for a typical MSG with about 5 million edges.Once the MSG is constructed in memory, it is serialized

to a binary file on disk in the same row-compressedadjacency matrix format to facilitate fast deserialization. Weobserved that deserializing a 40 MB MSG on a single SATAdisk drive takes about 0.6 seconds. In general, deserializa-tion speed can be greatly improved by increasing thetransfer rate of the disk subsystem.

7.2 Query Processing

For a given keyword query q, the query dispatcher retrievesfrom the Lucene index the posting list bsðqÞ (used as thebase set for the ObjectRank execution) and the bin identifierbðqÞ. Given a bin identifier, the MSG mapper determineswhether the corresponding MSG is already in memory. If itis not, the MSG deserializer reads the MSG representationfrom disk. The BinRank query processing module uses allavailable memory as an LRU cache of MSGs.

For smaller data graphs, it is possible to dramaticallyreduce MSG storage requirements by storing only a set ofMSG nodesV 0, and generating the corresponding set of edgesE0 only at query time. However, in our Wikipedia, data setthat would introduce an additional delay of 1.5-2 seconds,which is not acceptable in a keyword search system.

The ObjectRank module gets the in-memory instance ofMSG, the base set, and a set of ObjectRank calibratingparameters: 1) the damping factor d; 2) the convergencethreshold �; and 3) the number of top-k list entries k. Oncethe ObjectRank scores are computed and sorted, theresulting document ids are used to retrieve and presentthe top-k objects to the user.

Multikeyword queries are processed as follows: For agiven conjunctive query composed of n terms ft1; . . . ; tng,the ObjectRank module gets MSGs, fMSGðbðt1ÞÞ; . . . ;MSGðbðtnÞÞg, and evaluates each term over the correspond-ing MSG. Then, it multiplies the ObjectRank scores obtained

over MSGs to generate the top-k list for the query. For adisjunctive query, the ObjectRank module sums the Objec-tRank scores w.r.t. each term calculated using MSGs toproduce BinRank scores.

One of the advantages of BinRank query executionengine is that it can easily utilize large clusters of nodes. Inthis case, we distribute MSGs between the nodes andemploy Hadoop4 to start an MSG cache and an ObjectRankengine Web service on every node. A set of dispatcherprocesses, each with its own replica of the Lucene index,routes the queries to the appropriate nodes.

8 EXPERIMENTS

We present our experimental evaluation in this section. Wefirst describe our experimental setup using English Wiki-pedia articles. Then, we show scalability numbers forObjectRank followed by numbers for BinRank. Finally, wepresent a performance comparison of BinRank with MonteCarlo Method and HubRank.

8.1 Setup

We evaluate the performance of the BinRank algorithm onthe collection of English Wikipedia articles exported inOctober 2007. We parsed the 13.8 GB dump file andextracted 3.2 M articles and 109M intrawiki links of 10 types(e.g., “Regular links,” “Category links,” “See also links,”etc.). All the experiments in this section are performed overthe labeled graph Gwiki ¼ ðVwiki; EwikiÞ that is composed ofthe Wikipedia articles as nodes and the intrawiki links asedges. We used the standard row-compressed matrixformat to represent the link structure and weight dissipa-tion rates of Ewiki compactly. We were able to store the3:2M � 3:2M transition matrix of Gwiki with 109M nonzeroelements in only 880 MB. We created a Lucene text index ofthe Wikipedia article titles, which takes up 154 MB. Thedictionary of the index contains 698,214 terms.

We chose to index only article titles, by analogy with theoriginal ObjectRank [6] setup that used only publicationtitles from DBLP. It is important for ObjectRank to have abase set of objects that are highly related to a search term.However, a large article can mention a term without beingmeaningfully related to it. For that reason, title index worksbetter than an index on the full text of the articles. In orderto use the full article text index, the ObjectRank algorithmwould have to be augmented to take into account Lucenesearch scores of the base set documents. This is one of ourfuture research directions.

For our experiments, we implemented the BinRanksystem (and other algorithms for performance comparisons)in Java and performed experiments on a single PC with aPentium4 3.40 GHz CPU and 2.0 GB of RAM.

8.2 ObjectRank on the Full Wikipedia Graph

ObjectRank on Gwiki takes too long to be executed onlineand consumes around 880 MB of memory just for the linkinformation of Gwiki. As shown in Fig. 3, it takes around 20-50 seconds (30 seconds on average) to compute thedynamically generated top-k list for a given single keyword


4. http://hadoop.apache.org.


query even with our optimized, in-memory ObjectRank

execution engine. For frequent keywords that have posting

lists with more than 200 documents, the ObjectRank is likely

to take longer. Since frequent keywords are found in many

articles, they are more likely to be meaningfully connected

to many other articles through many paths, resulting in a

wider search space for ObjectRank to evaluate and rank.Fig. 3 also shows the keyword frequency distribution

obtained from the Lucene text index built on the article

titles. The total number of keywords in the index is

698,214, and the keyword frequencies follow the typical

power law distribution.

8.3 BinRank

During the BinRank preprocessing stage, we generate bins

for all the keywords in the corpus. Once the bins are

constructed, we generate an MSG per bin by executing

ObjectRank on Gwiki using the union of the posting lists of

the terms in a bin as a single base set. We first describe the

performance of the bin construction and MSG generation,

and then, measure the query result quality and the impact

of maxBinSize.

8.3.1 Preprocessing

Bin construction. To measure the performance of the bin

construction stage, we examine the bin construction time

and the number of bins constructed with different

maxBinSize values.

We construct bins for all terms in our Lucene index, exceptfor the 381 most frequent terms which have posting listslonger than a system parameter maxPostingList ¼ 2;000.Recall from Section 7 that such terms are deemed to be toofrequent, so we precompute their ObjectRank authorityvectors individually. This process takes 4.6 hrs.

To pack the remaining 697,833 keywords into terms, weconstruct bins with variousmaxBinSizes, as shown in Fig. 4.Note that as maxBinSize increases, the bin constructionalgorithm generates fewer bins while consuming more time.The running time goes up because the greedy algorithmneeds to try more intersections of larger sets to fill the largerbins. However, even with maxBinSize ¼ 12;000, BinRankgenerates all 345 bins in only 1,106 seconds. This is a smallfraction of the total preprocessing time, which is dominatedby MSG construction, as we will see next.

Note that Wikipedia page titles are a very simple case forbin generation as the typical document size is extremelysmall. We also tested the bin construction algorithm on thefull text of Wikipedia pages. In this case, the total size of theposting lists in the text index was 84 million versus4.8 million for titles. The algorithm produced 6,340 binswith maxBinSize 5;000, performing over 4 billion intersec-tions. The packing process took about 70 hours.

MSG generation. Once the bins are constructed, wegenerate an MSG for each bin. For our Wikipedia data set,we generated a comprehensive set of MSGs with 24 combi-nations of the two parameters maxBinSize and �. For eachcombination, we measure the performance of BinRank, i.e.,the query time and the quality of top-k lists.maxBinSize determines the number of bins to be

constructed, and thus, the number of MSGs generated(the second column in Fig. 5). The construction time andaverage size go up with the maxBinSize. Intuitively, thelarger the base set, the more objects will be related to it. Andthe more objects have nontrivial scores, the more iterationsit will take the ObjectRank algorithm to reach the fixpoint.Fig. 5 supports this intuition.

Note that the total MSG construction time decreasessignificantly, as the maxBinSize increases. However, theaverage MSG size increases at the same time, which leads to


Fig. 3. The number of keywords and average ObjectRank execution timeon the Wikipedia graph per frequency range (� is fixed to 5.0E-4).

Fig. 4. Performance of bin construction.

Fig. 5. The effect of maxBinSize on the MSG construction cost (� isfixed to 5.0E-4).

Fig. 6. The effect of � on the MSG construction cost (maxBinSize isfixed to 4,000).


slower query execution time. Thus, there is a clear trade-offbetween preprocessing time and query time in BinRank.

Fig. 6 shows the effect of � on MSG construction time and

the size of MSGs. Smaller � implies that ObjectRank will

need more iterations to reach the convergence point, and

more nodes will have scores above the bin convergence

threshold �b ¼ �BinSize . Thus, both construction time and

MSG size decrease as the � increases.An interesting observation from Figs. 5 and 6 is that the

storage requirements of BinRank, i.e., the total size of MSGs,

is controlled by the choice of � and is virtually unaffected by

maxBinSize. Of course, the quality of the BinRank’s score

approximations is also strongly affected by �, as we show

next. Thus, one has to strike a balance between the quality

of results and the storage overhead. For example, BinRank

produces extremely high-quality results with � ¼ 5:0E-4.

However, this setup requires 44 GB of storage for MSGs,

which is 50 times of the size of Gwiki. Another way to

approach this trade-off is to say that the amount of disk, or

even better, RAM available to the system will determine the

quality of results.As discussed in Section 7, it is possible to reduce MSGs

storage requirements by materializing MSG nodes only and

extract links at query time. The edge extraction adds 1.5-

2 seconds to the query time, but the storage requirements in

this case go down from 44 GB to only 203 MB, which is

similar to the size of our Lucene index, 154 MB.

8.3.2 Query Processing

Quality measures. For a given keyword query, BinRank

generates an approximate top-k list using the correspond-

ing MSG. The exact top-k list is obtained by executing

ObjectRank on Gwiki with small � ¼ 1:0E-4. The two lists are

compared using the same three quality measures as in [4]:

relative aggregated goodness (RAG), precision at K, and

Kendall’s � .Let ORðkw;KÞ and BRðkw;KÞ denote the accurate top-k

list by ObjectRank and the approximate top-k list by BinRank

for a given keyword kw. In our experiments, both top-k lists

are lists of Wikipedia article IDs sorted by the authority

score. LetORScoreðn; kwÞ denote the exact keyword-specific

authority score of a node n computed by ObjectRank.

RAG and precision measure the quality of BRðkw;KÞ byconsidering top-k lists as sets, say ORSetðkw;KÞ andBRSetðkw;KÞ. RAG is the ratio of the aggregated exactauthority scores of nodes in BRðkw;KÞ to scores of nodes inORðkw;KÞ. Precision at K computes the ratio of the size ofintersection to K

RAGðKÞ ¼P

n2BRSetðkw;KÞORScoreðn; kwÞPn2ORSetðkw;KÞORScoreðn; kwÞ

;

PrecðKÞ ¼ jBRSetðkw;KÞ \ORSetðkw;KÞjK

:

Kendall’s � , as defined in Section 4, compares theorderings of the top-k lists, i.e., ORðkw;KÞ and BRðkw;KÞ.It is the most stringent quality measure of the threemeasures that we use. � value of 1 means that the lists areidentical, and 0 that they are disjoint or in inverse order.

Since we primarily aim to get high-quality top-k listswithin reasonable amount of query time, we want to findgood combinations of maxBinSize and � for BinRank. Totune these parameters, we compute quality measures for all24 sets of MSGs described above, six different maxBinSizevalues, and four different � values. The smallestmaxBinSize, 2,000, is chosen to be the same as themaximum posting list size for terms that are put into bins.

We run a workload of 92 randomly selected query termson all of these 24 sets of MSGs.

Effect of maxBinSize on query time and quality of top-k lists using BinRank. With � ¼ 5:0E-4, we generatedMSGs with six different maxBinSize values starting fromthe smallest maxBinSize 2,000. Fig. 7 shows that query timeincreases linearly as maxBinSize increases. This is becausethe average size of MSGs also increases linearly, as depictedin Fig. 5. For example, when maxBinSize is 2,000, an MSGis 21 MB, but it increases to 42 MB if maxBinSize increasesto 4,000.

Next, we investigate the effects of the MSG size, which isdetermined by the maxBinSize, on the accuracy of top-klists. Fig. 8 shows the average accuracy of top-100 listsmeasured by the three goodness measures given � ¼ 5:0E-4.First, all the measures are in ½0:95; 1� range, indicating thatthe quality of the top-100 lists obtained by BinRank is verygood. Second, as maxBinSize increases from 2,000 to12,000, the accuracy remains the same or improves veryslightly. However, we do not see a noticeable improvement


Fig. 7. The effect of maxBinSize on the BinRank running time.

Fig. 8. The effect of maxBinSize on the top-100 accuracy (� is fixed to5.0E-4).


on the quality of top-k lists. In contrast, the accuracy of top-k is sensitive to �, as shown in Fig. 11.

Fig. 9 illustrates the relationship between maxBinSize, �,and the accuracy of top-k lists. It shows the distribution of�5 through �1;000 with 12 combinations of the parameters: allsix different maxBinSize and 2� values, 5.0E-3 and 5.0E-4.One can see that the 12 lines form two clusters, one for� ¼ 5:0E-3ðbottomÞ, and the other for � ¼ 5:0E-4ðtopÞ.

For a given � and a set of maxBinSize values, if largermaxBinSize does not improve the quality of top-k lists in abig margin, then we do not see any good reason to increasemaxBinSize. Actually, it decreases the preprocessing timein Fig. 5 by reducing the number of MSGs, but increasesthe query processing time, as shown in Fig. 7. For example,with � ¼ 5:0E-4, we can see from Fig. 5 that the average sizeof MSGs is 127 MB when maxBinSize ¼ 12;000, while it is42 MB for maxBinSize ¼ 4;000. However, Fig. 9 shows thatthe top-k lists generated on these two MSGs are verysimilar on average. We computed standard deviations of� values of top-k lists with varying maxBinSize values anda fixed �. They are very low: stdevð�20Þ ¼ 0:00627 andstdevð�100Þ ¼ 0:00672.

However, we cannot reduce maxBinSize without con-sidering the total MSG construction time. One might want toconstruct bins with very smallmaxBinSize. Setting aside theaccuracy issue, BinRank will construct too many bins tocomplete MSGs construction stage in a given time budget.The extreme case is to precompute and materialize MSGs orauthority vectors for all the keywords in the dictionary, whichis infeasible especially when the size of the dictionary and thesize of the full graph are huge as in our Wikipedia data set.

Effect of � on query time and quality of top-k lists of

BinRank. As observed in Fig. 6, as � increases, the averagesize of MSGs also increases. It takes more time to generatetop-k lists on a larger MSG, on average, as shown in Fig. 10.

Now, we analyze the effect of � on the quality of top-klists. Unlike maxBinSize, the quality of top-k lists im-proves, as can be seen in Fig. 11.

To measure how much an MSG covers the context ofkeywords in the corresponding bin, we computed theRankMass coverage metric [15] of sets of MSGs generatedwith five different � values. In our experiments, we define

the RankMass of an MSG w.r.t. a keyword as the ratio ofaggregated authority scores of nodes in the MSG to the sumof all authority scores in Gwiki ¼ ðVwiki; EwikiÞ. LetMSGðb; e;mÞ denote a set of nodes in the MSG generatedfor a bin b with � ¼ e and maxBinSize ¼ m. Let us assumethat the bin b contains a workload keyword kw. Then, theRankMass of an MSG w.r.t. kw is:

RankMassðMSGðb; e;mÞ; kwÞ ¼P

v2MSGðb;e;mÞORðv; kwÞPv2Vwiki ORðv; kwÞ

:

We computed the average RankMass coverage of anMSG using all the keywords in our workload, which showshow well an MSG covers the context of keywords in thecorresponding bin. As we can expect with an increasing �,the RankMass also increases rapidly.

For example, if we compare two sets of MSGs constructedwith maxBinSize ¼ 4;000 and maxBinSize ¼ 6;000,

avgðjMSGðb; 5E-4; 12;000ÞjÞ¼ 3 � avgðjMSGðb; 5E-4; 4;000ÞjÞ;

but the average RankMass only increases by 5.7 percent. Theaverage size of MSGs of maxBinSize ¼ 4;000 is 1.52 percentof jVwikij, while that of maxBinSize ¼ 6;000 is 4.59 percent of


Fig. 9. The effect of maxBinSize on the top-k accuracy with fixed �.

Fig. 10. The effect of � on the BinRank running time.

Fig. 11. The effect of � on the top-100 accuracy (maxBinSize is fixed to4,000).


jVwikij. However, if we decrease � from 1E-3 to 5E-4, theaverage size of MSGs increases from 0.98 percent of jVwikij to1.52 percent of jVwikij, while the RankMass increases by7.0 percent.

8.4 BinRank for Multikeyword Queries

In this section, we investigate the performance of BinRankfor multikeyword queries. Given a multikeyword query qcomposed of n keywords k1 . . . kn, BinRank first evaluateseach ki over the MSG corresponding to the keywordMSGðkiÞ. Then, it combines the rank scores computed overthose MSGs according to query semantics to produce thetop-k list for q.

We observed from our experimental results that if amultikeyword query contains highly relevant keywordssuch as “martial” AND “arts” or “fine” AND “performing,”BinRank assigns those relevant keywords into the same bin,and thus, evaluates those keywords using the same MSG. Inthis case, the top-k accuracy of the query is higher thanrandomly generated multikeyword queries. However, ifkeywords composing a multikeyword query are assigned todifferent bins and the query is conjunctive, BinRank has toevaluate each keyword over different MSGs and combinescores. We assign zero scores to nodes not in the MSG.Hence, if a conjunctive query contains keywords whoseMSGs do not overlap, BinRank will return an empty result.However, we observed no such cases thorughout ourexperiments, because certain highly popular subgraphs ofGwiki obtain non-negligible scores regardless of the key-words assigned to a bin.

We randomly generated 600 multikeyword queries tomeasure the top-k accuracy of conjunctive queries anddisjunctive queries containing 2 to 4 keywords. Throughoutour experiments, we usemaxBinSize ¼ 4;000 and � ¼ 5:0E-4.We do not report the statistics of the BinRank running time formultikeyword queries, because it is dominated by therunning time of BinRank for each individual query term.

We can see from Fig. 12 that the top-100 accuracy ofdisjunctive queries is higher than the top-100 accuracy forsingle-keyword queries, as shown in Figs. 8 and 11. As shownin Fig. 9, the accuracy of a top-k list drops as K increases,because scores of highly ranked nodes are more stable thanthose of the rest. Since the top-100 list for a disjunctive querytends to include top-k nodes (K 100) in the top-100 listsobtained over MSGs, its accuracy is at least as high as that forsingle-keyword queries or slightly higher than that.

As is shown in Fig. 13, RAGð100Þ and Precð100Þ areabove 0.9, indicating that the top-100 lists obtained byBinRank include most of the nodes in the top-100 listsgenerated by ObjectRank over Gwiki. However, the average�100 for conjunctive queries remains in ½0:75; 0:8� range,which is lower than those for single-keyword queries (Fig. 8)or disjunctive queries (Fig. 12). Therefore, we can see that fora given conjunctive query, BinRank generates a top-k listthat contains all the nodes in the top-k list obtained overGwiki, but the ordering of nodes in the BinRank top-k list isnot highly accurate. This is mainly because the MSGs are notlarge enough to cover all the important paths through whichsignificant amount of authority flows into or between thetop-100 nodes, even though most of the links between top-100 nodes exist on the corresponding MSGs. To improve thetop-k accuracy of conjunctive queries, we can increase thecoverage of MSGs by using smaller �. Note that increasingmaxBinSize does not improve the top-k accuracy in a bigmargin, as shown in Fig. 8.

However, with smaller �, BinRank generates largerMSGs, increasing query execution time. Especially, weobserved that some MSGs require unacceptably longrunning time. Given time budget, we want to identify suchMSGs and recompute them as described in Section 8.5.

8.5 Adaptive MSG Recomputation

In this section, we first want to examine the entire set ofMSGs to understand the features of MSGs. We obtain1,043 bins, and then, generate a set of MSGs M usingBinRank parameters maxBinSize ¼ 4;000 and � ¼ 5:0E-4.The average number of nodes and links on an MSG is 48,616and 5.2M, which is just 1.52 percent of jVwikij and 4.83 percentof jEwikij. Recall that Gwiki has 3.2M nodes and 109M links.

Next, to evaluate the quality of the MSGs inM, we pick aset of keywords Q by selecting the keyword with largestfrequency among the keywords assigned to each bin. Therange of keyword frequency of Q is ½1; 2;000�. We select themost frequent keyword for each bin since they are verylikely to result in the slowest BinRank execution time out ofall keywords in the bin, as discussed in Section 6.

The average BinRank execution time for queries in Q is856 ms, which is much faster than the average ObjectRankexecution time on Gwiki, 30 seconds. However, we observethat some queries in Q require almost 2 seconds to evaluate,which is sometimes not acceptable. The goal of the MSGrecomputation algorithm is to predict and prevent such


Fig. 12. The top-100 accuracy for disjunctive queries (maxBinSize ¼4;000 and � ¼ 5:0E-4).

Fig. 13. The top-100 accuracy for conjunctive queries (maxBinSize ¼4;000 and � ¼ 5:0E-4).


cases so that the BinRank running time does not exceed acertain time budget, which is set to 1 second throughoutthese experiments.

As we discussed in Section 6, the BinRank query runningtime depends on the features of the query (e.g., the base setsize) and those of the corresponding MSG (e.g., the numberof nodes and the number of links). Other factors such as theconnectivity of links on an MSG and the topology of base setnodes also affect the BinRank running time, but they areharder to quantify, and the simple features prove to besufficiently good predictors.

The correlation coefficients, denoted by r, between theBinRank running time and each of the three simple featuresare the followings:

. r1 ¼ 0:564: with the number of nodes on an MSG.

. r2 ¼ 0:700: with the number of links on an MSG.

. r3 ¼ 0:459: with the base set size of a query.

r2 is noticeably higher than r1 or r3, which indicates that thenumber of links on an MSG is more tightly correlated to theBinRank running time than the other two features. Actually,since r2 is obtained from all the queries in Q and their baseset sizes vary significantly within ½1; 2;000�, we can see theeffect of the number of links on an MSG more clearly afterreducing the effect of the base set size. To do it, we select aset of 292 queries with high frequency, ½1;000; 2;000� anddenote it as Qhf . From Fig. 14, obtained using Qhf , we canclearly observe a very strong correlation between thenumber of links on an MSG and the BinRank running time.With very high R2 value, the BinRank running time of aquery is almost linear in the number of links on thecorresponding MSG. Also, the correlation coefficient be-tween the number of links on an MSG and the BinRankrunning time using high-frequency keywords in Qhf is0.938, which also indicates a very strong correlation.

By exploiting this strong correlation, we select MSGswhose BinRank running time will be above a certain timebudget with high probability. As can be seen in Fig. 15, thenumber of links on an MSG almost follows a normaldistribution Nð�; �2Þ, where � ¼ 5:2E6 and � ¼ 1:0E6. Ourexperiments show that among the 1,043 MSGs in M,144 MSGs have more than ð�þ �Þ links, and amongthem, 138 MSGs (94.4 percent) require more than 1 sec toproduce top-k lists for the largest frequency keyword in the

corresponding bin. In contrast, the probability that theworst-case BinRank running time exceeds 1 sec is just16.4 percent for MSGs with less than ð�þ �Þ links. If wepick the MSGs with less than � links, only 5.6 percent ofthem spend more than 1 sec to compute top-k lists in theworst case. Therefore, by default, BinRank sets themaxMSGSize parameter to ð�þ �Þ and recomputes binsfor all the MSGs with higher link counts, using the halvedmaxBinSize, as described in Section 6. Recall from Fig. 7that reducing the maxBinSize linearly reduces the querytime, thus dramatically reducing the number of queriesrunning over budget.

In general case, maxMSGSize could be set to ð�þ x�Þ,where good candidates for x are within ½0; 1� as we can seein our experimental results. In future, we plan to investigateoptimizing x, while considering such factors as the time andspace budget for MSG generation. For example, if x ¼ 0, weneed to regenerate about 50 percent of the MSGs, while weregenerate only 14 percent of them when x ¼ 1.

Another approach we are planning to investigate is tobase maxMSGSize on the actual query performancemeasurements. The BinRank running time also follows anormal distribution Nð�t; �2

t Þ and the time budget, 1 secondin our experiments, corresponds to �t þ 0:58�t. Since theBinRank running time and the number of links on an MSGare highly correlated as shown in Fig. 14, we can use 0.58 asx to select MSGs to regenerate.

8.6 Performance Comparison of BinRank withMonte Carlo Method and HubRank

In this section, we present a performance comparison ofBinRank over Monte Carlo style methods and HubRank.We implemented the Monte Carlo algorithm 4, “MCcomplete path stopping at dangling nodes,” introduced in[5] and HubRank [8] that combines a hub-based approachand a Monte Carlo method called fingerprint.

For a given keyword query, the Monte Carlo algorithmsimulates random walks starting from nodes containing thekeyword. Within a specified number of walks, it samplesexactly the same number of random walks per each startingpoint. The authority score of a node is the total number ofvisits to the node divided by the total number of visits. Fig. 16shows the performance of the Monte Carlo algorithm interms of accuracy of top-k lists and various query times. We


Fig. 14. The effect of the number of links on an MSG on the BinRankrunning time (maxBinSize ¼ 4;000 and � ¼ 5:0E-4). The Pearsoncorrelation coefficient is 0.938.

Fig. 15. The distribution of the number of links on an MSG (1,043 MSGsgenerated by using maxBinSize ¼ 4;000 and � ¼ 5:0E-4).


used our workload keyword queries and executed the MonteCarlo algorithm with different total numbers of sampledwalks. As the number of sampled walks increases, thealgorithm generates higher quality top-k lists, which usuallytakes more time.

However, we can see that � values in Fig. 16 are not as highas those of BinRank in Fig. 8. With maxBinSize ¼ 2;000 or4,000 and � ¼ 5E-4, BinRank generates high-quality top-klists of � � 0:95 in 350-750 ms, on average, as shown in Figs. 8and 10. However, according to Fig. 16, the Monte Carloalgorithm generates top-k lists of � � 0:70 within the sameamount of time. To get high-quality top-k lists, it would takethe Monte Carlo algorithm around 7 seconds per query term,which is probably not acceptable in a online search system.

We also implemented HubRank [8] in order to measurethe scalability and the top-k quality over Gwiki. We selectedhubs, and then, materialized a large number of fingerprints,while keeping the hub set fairly focused to our experimentalquery workload to save preprocessing cost. For a givenkeyword query, HubRank generates the active graph of thequery by expanding the base set’s neighborhood untilbounded by hub nodes or nodes very far from the givenquery node. Since Gwiki contains 3.2M nodes and 109Mlinks, we often needed to compute many (thousands) ofactive vectors to answer a single query, where each activevector is a (sparse) vector of 3.2 million numbers. Due tothis requirement, for most queries, we could not keep all thenecessary active vectors in memory. The authors of [8] alsoreported that their implementation ran out of memory on afew queries, while they were running the experiments on agraph with less than a million edges. A two orders ofmagnitude increase in the size of the graph made thisproblem ubiquitous and prevented us from obtainingcomparable results.

9 SUMMARY AND CONCLUSIONS

In this paper, we proposed BinRank as a practical solutionfor scalable dynamic authority-based ranking. It is basedon partitioning and approximation using a number ofmaterialized subgraphs. We showed that our tunablesystem offers a nice trade-off between query time andpreprocessing cost.

We introduce a greedy algorithm that groups co-occurring

terms into a number of bins for which we compute

materialized subgraphs. Note that the number of bins is

much less than the number of terms. The materialized

subgraphs are computed offline by using ObjectRank itself.

The intuition behind the approach is that a subgraph that

contains all objects and links relevant to a set of related terms

should have all the information needed to rank objects with

respect to one of these terms. Our extensive experimental

evaluation confirms this intuition.For future work, we want to study the impact of other

keyword relevance measures, besides term co-occurrence,

such as thesaurus or ontologies, on the performance of

BinRank. By increasing the relevance of keywords in a bin,

we expect the quality of materialized subgraphs, thus the

top-k quality and the query time can be improved.We also want to study better solutions for queries whose

random surfer starting points are provided by Boolean

conditions. And ultimately, although our system is tunable,

the configuration of our system ranging from number of

bins, size of bins, and tuning of the ObjectRank algorithm

itself (edge weights and thresholds) is quite challenging,

and a wizard to aid users is desirable.To further improve the performance of BinRank, we plan

to integrate BinRank and HubRank [8] by executing

HubRank on MSGs BinRank generates. Currently, we use

the ObjectRank algorithm on MSGs in query time. Even

though HubRank is not as scalable as BinRank, it performs

better than ObjectRank on smaller graphs such as MSGs. In

this way, we can leverage the synergy between BinRank

and HubRank.

REFERENCES

[1] S. Brin and L. Page, “The Anatomy of a Large-Scale HypertextualWeb Search Engine,” Computer Networks, vol. 30, nos. 1-7, pp. 107-117, 1998.

[2] T.H. Haveliwala, “Topic-Sensitive PageRank,” Proc. Int’l WorldWide Web Conf. (WWW), 2002.

[3] G. Jeh and J. Widom, “Scaling Personalized Web Search,” Proc.Int’l World Wide Web Conf. (WWW), 2003.

[4] D. Fogaras, B. Racz, K. Csalogany, and T. Sarlos, “Towards ScalingFully Personalized PageRank: Algorithms, Lower Bounds, andExperiments,” Internet Math., vol. 2, no. 3, pp. 333-358, 2005.

[5] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova,“Monte Carlo Methods in PageRank Computation: When OneIteration Is Sufficient,” SIAM J. Numerical Analysis, vol. 45, no. 2,pp. 890-904, 2007.

[6] A. Balmin, V. Hristidis, and Y. Papakonstantinou, “ObjectRank:Authority-Based Keyword Search in Databases,” Proc. Int’l Conf.Very Large Data Bases (VLDB), 2004.

[7] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma, “Object-Level Ranking:Bringing Order to Web Objects,” Proc. Int’l World Wide Web Conf.(WWW), pp. 567-574, 2005.

[8] S. Chakrabarti, “Dynamic Personalized PageRank in Entity-Relation Graphs,” Proc. Int’l World Wide Web Conf. (WWW), 2007.

[9] H. Hwang, A. Balmin, H. Pirahesh, and B. Reinwald, “InformationDiscovery in Loosely Integrated Data,” Proc. ACM SIGMOD, 2007.

[10] V. Hristidis, H. Hwang, and Y. Papakonstantinou, “Authority-Based Keyword Search in Databases,” ACM Trans. DatabaseSystems, vol. 33, no. 1, pp. 1-40, 2008.

[11] M. Kendall, Rank Correlation Methods. Hafner Publishing Co., 1955.[12] M.R. Garey and D.S. Johnson, “A 71/60 Theorem for Bin

Packing,” J. Complexity, vol. 1, pp. 65-106, 1985.[13] K.S. Beyer, P.J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla,

“On Synopses for Distinct-Value Estimation under MultisetOperations,” Proc. ACM SIGMOD, pp. 199-210, 2007.


Fig. 16. Top-k accuracy of Monte Carlo algorithm with various querytimes.


[14] J.T. Bradley, D.V. de Jager, W.J. Knottenbelt, and A. Trifunovic,“Hypergraph Partitioning for Faster Parallel PageRank Computa-tion,” Proc. Second European Performance Evaluation Workshop(EPEW), pp. 155-171, 2005.

[15] J. Cho and U. Schonfeld, “Rankmass Crawler: A Crawler withHigh PageRank Coverage Guarantee,” Proc. Int’l Conf. Very LargeData Bases (VLDB), 2007.

Heasoo Hwang is currently working toward thePhD degree in computer science at the Uni-versity of California, San Diego. Her primaryresearch interests lie in the effective and efficientsearch and management of large-scale graph-structured data sets. She spent two summerswith IBM Almaden research center, where shewas working on improving the efficiency ofdynamic link-based search over graph-struc-tured data which motivated this BinRank paper.

The BinRank system is a part of her thesis work.

Andrey Balmin received the PhD degree fromthe University of California at San Diego, wherehe devised ObjectRank algorithm as part of histhesis work. He is a research staff member atIBM’s Almaden Research Center, where hisresearch interests include search, querying, andmanagement of semistructured and graph data.

Berthold Reinwald received the PhD degreefrom the University of Erlangen-Nuernberg,Germany, in 1993. Since 1993, he has beenwith the IBM Almaden Research Center, wherehe is currently a research staff member. Hiscurrent research interests include scalable ana-lytics and cloud data management. He is amember of the ACM.

Erik Nijkamp is working toward the graduatedegree in computer science at the TechnicalUniversity of Berlin. He is now a researchassistant in the Department of Database Systemand Information Management. His researchinterests focus on aspects of distributed comput-ing and machine learning.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.



BinRank Scaling Dynamic Authority-Based Search Using Materialized Subgraphs

Documents