Conditional Reliability in Uncertain GraphsarXiv:1608.04474v3 [cs.SI] 29 Apr 2018 1 Conditional Reliability in Uncertain Graphs Arjiit Khan, Francesco Bonchi, Francesco Gullo, and

arX

iv:1

608.

0447

4v3

[cs

.SI]

29

Apr

201

81

Conditional Reliability in Uncertain GraphsArjiit Khan, Francesco Bonchi, Francesco Gullo, and Andreas Nufer

Abstract—Network reliability is a well-studied problem that requires to measure the probability that a target node is reachable from

a source node in a probabilistic (or uncertain) graph, i.e., a graph where every edge is assigned a probability of existence. Many

approaches and problem variants have been considered in the literature, majority of them assuming that edge-existence probabilities

are fixed. Nevertheless, in real-world graphs, edge probabilities typically depend on external conditions. In metabolic networks, a protein

can be converted into another protein with some probability depending on the presence of certain enzymes. In social influence networks,

the probability that a tweet of some user will be re-tweeted by her followers depends on whether the tweet contains specific hashtags.

In transportation networks, the probability that a network segment will work properly or not, might depend on external conditions such

as weather or time of the day.

In this paper, we overcome this limitation and focus on conditional reliability, that is, assessing reliability when edge-existence

probabilities depend on a set of conditions. In particular, we study the problem of determining the top-k conditions that maximize the

reliability between two nodes. We deeply characterize our problem and show that, even employing polynomial-time reliability-estimation

methods, it is NP-hard, does not admit any PTAS, and the underlying objective function is non-submodular. We then devise a practical

method that targets both accuracy and efficiency. We also study natural generalizations of the problem with multiple source and target

nodes. An extensive empirical evaluation on several large, real-life graphs demonstrates effectiveness and scalability of our methods.

Index Terms—Uncertain graphs, Reliability, Conditional probability.

✦

1 INTRODUCTION

Uncertain graphs, i.e., graphs whose edges are assigned aprobability of existence, have recently attracted a great dealof attention, due to their rich expressiveness and giventhat uncertainty is inherent in the data in a wide range ofapplications. Uncertainty may arise due to noisy measure-ments [2], inference and prediction models [1], or explicitmanipulation, e.g., for privacy purposes [7]. A fundamentalproblem in uncertain graphs is the so-called reliability, whichasks to measure the probability that two given (sets of)nodes are reachable [3]. Reliability has been well-studied inthe context of device networks, i.e., networks whose nodesare electronic devices and the (physical) links between suchdevices have a probability of failure [3]. More recently, theattention has been shifted to other types of networks thatcan naturally be represented as uncertain graphs, such associal networks or biological networks [23], [32].

In the bulk of the literature, reliability queries have beenmodeled without taking into account any external factorthat could influence the probability of existence of the linksin the network. In this paper, we overcome this limitationand introduce the notion of conditional reliability, which takesinto account that edge probabilities may depend on a set ofconditions, rather being fixed. This situation models real-world uncertain graphs. As an example, Figure 1 shows alink (u, v) of a social influence network, i.e., a social graphwhere the associated probability represents the likelihoodthat a piece of information (e.g., a tweet) originated by u

• A. Khan is with Nanyang Technological University, Singapore.• F. Bonchi is with ISI Foundation, Italy.• F. Gullo is with UniCredit, R&D Dept., Italy.• A. Nufer is with ETH Zurich, Switzerland.

The research is supported by MOE Tier-1 RG83/16 and NTU M4081678.Manuscript received January 31, 2017.

u v

P((u,v)|c) c P((u, v)|c)

#NFL 0.9

#GetToThePolls 0.2

… …

Fig. 1: A link (u, v) of a social influence network, where the associatedprobability represents the likelihood that a tweet by u will be re-tweeted by herfollower v. This probability depends on the content of the tweet. In this exampleif the tweet contains the hashtag #NFL, then it will likely be re-tweeted, whileif it is about elections (i.e., it contains the hashtag #GetToThePolls), it will bere-tweeted only with a small probability.

will be “adopted” (re-tweeted) by her follower v. The re-tweeting probability clearly depends on the content of thetweet. In the example, v is much more interested in sportsthan politics. Hence, if the tweet contains the hashtag #NFL,then it will likely be re-tweeted by v, while if it is aboutelections (i.e., it contains the hashtag #GetToThePolls),it will be re-tweeted only with a small probability. In thisexample, hashtags correspond to external factors that in-fluence probabilities. We hereinafter refer to such externalfactors as conditions or catalysts, and use all these termsinterchangeably throughout the paper.

Given an uncertain graph with external-factor-dependent edge probabilities, in this work we studythe following problem: Given a source node, a target node,and a small integer k, identify a set of k catalysts thatmaximizes the reliability between s and t. This problemarises in many real-world scenarios, such as the onesdescribed next.

Pathway formation in biological networks. To understandmetabolic chain reactions in cellular systems, biologists uti-lize metabolic networks [22], where nodes represent com-pounds, and an edge between two compounds indicatesthat a compound can be transformed into another onethrough a chemical reaction. Reactions are controlled by var-ious enzymes, and each enzyme defines a probability thatthe underlying reaction will actually take place. Thus, reac-

http://arxiv.org/abs/1608.04474v3

2

tions (edges) are assigned various probabilities of existence,which depend on the specific enzyme (external factor). Afundamental question posed by biologists is to identify a setof enzymes which guarantee with high probability that asequence of chemical reactions will take place to convertan input compound s into a target compound t. Sinceenzymes are expensive (they need to go through a longmulti-step process before being commercialized [9]), theoutput enzyme set should be limited in size. Often knownas cost-effective experiment design [29], [31], this correspondsto solving an instance of our problem: Given a sourcecompound s and a target compound t, what is the set oftop-k enzymes which maximizes the probability that s willbe converted into t via a series of chemical reactions?Information cascades. Studying information cascades ininfluential networks is receiving more and more attention,mainly due to its large applicability in viral marketing strate-gies. Social influence can be modeled as in Figure 1, i.e., bymeans of a probability that once u has been “activated” bya campaign, she will influence her friend v to perform thesame action. This probability typically depends on topicsand contents of the campaign [6], [11]. Within this view,let us consider the following example, which is motivatedfrom [27]. During the 2016 US Presidential election, HillaryClinton’s campaign promises were infrastructure rebuild,free trade, open borders, unlimited immigration, equal pay,background checks to gun sales, increasing minimum wage,etc. To get more votes, Hillary’s publicity manager couldhave prioritized the most influential among all these stand-points in subsequent speeches from her, her vice presidentialcandidate (Tim Kaine), and her political supporters (e.g.,Barack and Michelle Obama), while also planning how toinfluence more voters from the “blue wall” states (Michigan,Pennsylvania, and Wisconsin) [33]. As speeches should bekept limited due to time constraints and risk of becomingineffective in case of information overload, it is desirable tofind a limited set of standpoints that maximize the influencefrom a set of early adopters (e.g., popular people who areclose to Hillary Clinton) to a set of target voters (e.g., citizensof the “blue wall” states) [4]. This corresponds to identifyingthe top-k conditions that maximize the reliability betweentwo (sets of) nodes in the social graph, i.e., the problem westudy in this work.

Challenges and contributions. The problem that we studyin this work is a non-trivial one. Computing standard re-liability over uncertain graphs is a #P-complete problem[5]. We show that, even assuming polynomial-time sam-pling methods to estimate conditional reliability (such asRHT-sampling [23], recursive stratified sampling [26]), ourproblem of computing a set of k catalysts that maximizesconditional reliability between two nodes remains NP-hard. Moreover, our problem turns out to be not easy toapproximate, as (i) it does not admit any PTAS, and(ii) the underlying objective function is shown to be non-submodular. Therefore, standard algorithms, such as iter-ative hill-climbing that greedily maximizes the marginalgain at every iteration, do not provide any approximationguarantees and are expected to have limited performance.Within this view, we devise a novel algorithm that first ex-tracts highly-reliable paths between source and target nodes,and then iteratively selects these paths so as to achieve

maximum improvement in reliability while still satisfyingthe constraint on the number of conditions.

After studying the single-source-single-target query, wefocus on generalizations where multiple source and targetnodes can be provided as input, thus opening the stage toa wider family of queries and applications. We study twovariants of this more general problem: (i) maximizing anaggregate function over pairwise reliability between nodesin source and target sets, and (ii) maximizing the probabilitythat source and target nodes remain all connected.

The main contributions of this paper are as follows:

• We focus on the notion of conditional reliability inuncertain graphs, which arises when the input graphhas conditional edge-existence probabilities. In partic-ular, we formulate and study the problem of findinga limited set of conditions that maximizes reliabilitybetween a source and a target node (Section 2).

• We deeply characterize our problem from a theoreticalpoint of view, showing that it is NP-hard and hardto approximate even when polynomial-time reliabilityestimation is employed (Section 2).

• We design an algorithm that provides effective (approx-imated) solutions to our problem, while also lookingat efficiency. The proposed method properly selectsa number of highly-reliable paths so as to maximizereliability while satisfying the budget on the number ofconditions (Section 4).

• We generalize our problem and algorithms to the caseof multiple source and target nodes (Section 5).

• We empirically demonstrate effectiveness and efficiencyof our methods on real-life graphs, while also detailingapplications in information cascade (Section 6).

2 SINGLE-SOURCE SINGLE-TARGET:

PROBLEM STATEMENT

An uncertain graph G is a quadruple (V,E,C, P ), where Vis a set of n nodes, E ⊆ V × V is a set of m directed edges,and C is a set of external conditions that influence the edge-existence probabilities. We hereinafter refer to such externalconditions as catalysts. P : E × C → (0, 1] is a function thatassigns a conditional probability to each edge e ∈ E givena specific catalyst c ∈ C, i.e., P (e|c) denotes the probabilitythat the edge e exists given the catalyst c.

The bulk of the literature on uncertain graphs assumesthat edge probabilities are independent of one another [23].In this work, we make the same assumption. Additionally,we assume that the existence of an edge is determined byan independent process (coin flipping), one per catalyst c,and the ultimate existence of an edge is decided basedon the success of at least one of such processes. This as-sumption naturally holds in various settings. For instance,in a metabolic network, with an initial compound and anenzyme, the probability that a target compound would beproduced depends only on that specific reaction, and itis independent of other chemical reactions defined in thenetwork. As a result, the global existence probability of anedge e, given a set of catalysts C1 ⊆ C, can be derived asP (e|C1) = 1−

∏

c∈C1(1− P (e|c)).

Given a set C1 of catalysts, the uncertain graph G yields2m deterministic graphs G ⊑ G|C1, where each G is a pair

3

1

2

3

4

Fig. 2: Example of non-submodularity. P (e1|c1) = 0.5, P (e2|c2) = 0.6,P (e3|c3) = 0.5, P (e4|c1) = 0.5. P (e|c) = 0 for all other edge-catalystcombinations that are not specified.

(V, EG), with EG ⊆ E, and its probability of being observedis given below.

P (G|C1) =∏

e∈EG

P (e|C1)∏

e∈E\EG

(1− P (e|C1)) (1)

For a source node s ∈ V , and a target node t ∈ V , wedefine conditional reliability R ((s, t)|C1) as the probabilitythat t is reachable from s in G, given a set C1 of catalysts.Formally, for a possible graph G ⊑ G|C1, let IG(s, t) be anindicator function taking value 1 if there exists a path froms to t in G, and 0 otherwise. R ((s, t)|C1) is computed asfollows.

R ((s, t)|C1) =∑

G⊑G|C1

[IG(s, t)× P (G|C1)] (2)

The problem that we tackle in this work is introduced next.

Problem 1 (s-t TOP-k CATALYSTS). Given an uncertain graphG = (V,E,C, P ), a source node s ∈ V , a target node t ∈ V , anda positive integer k, find a set C∗ ⊆ C of catalysts, having sizek, that maximizes the conditional reliability R ((s, t)|C∗) from sto t:

C∗ = argmaxC1⊆C

R ((s, t)|C1)

subject to |C1| = k. (3)

Intuitively, the top-k set C∗ yields multiple high-probability paths from the source node s to the target nodet. Any specific path can have edges formed due to differentcatalysts.

Theoretical characterization. Problem 1 intrinsically relieson the classical reliability problem 1, which is #P-complete[5]. As a result, Problem 1 is hard as well.

However, like standard reliability, conditional reliabilitycan be estimated in polynomial time via Monte Carlo sam-pling, or other sampling methods [23]. Thus, the key ques-tion is whether Problem 1 remains hard even if polynomial-time conditional-reliability estimation is employed. As for-malized next, the answer to this question is positive.

Theorem 1. Problem 1 is NP-hard even assuming polynomial-time computation for conditional reliability.

Proof. We prove NP-hardness by a reduction from theMAX k-COVER problem. In MAX k-COVER, we are givena universe U , and a set of h subsets of U , i.e., S ={S1, S2, . . . , Sh}, where Si ⊆ U , for all i ∈ [1 . . . h]. Thegoal is to find a subset S∗ of S, of size |S∗| = k, suchthat the number of elements covered by S∗ is maximized,i.e., so as to maximize | ∪S∈S∗ S|. Given an instance ofMAX k-COVER, we construct in polynomial time an instanceof s-t TOP-k CATALYSTS problem as follows.

We create an uncertain graph G with a source node s anda target node t. We add to G a set of nodes u1, u2, . . . , uZ ,one for each element in U (Z = |U |). We connect each of

1Given an uncertain graph, a source node s, and a target node t, computethe probability that t is reachable from s.

these nodes ui to the target node t with a (directed) edge(ui, t), and assume that each of such edges (ui, t) can occuronly in the presence of a single catalyst c with a certainprobability p < 1, i.e., ∀i ∈ [1..Z] : P ((ui, t)|c) = p andP ((ui, t)|c′) = 0, ∀c′ 6= c. Similarly, we put in G anotherset of nodes x1, x2, . . . , xZ (again one for each element inU ), and connect each of these nodes xi to the source node swith an edge (s, xi). Each of such edges (s, xi) can also bepresent only in the presence of catalyst c, with probabilityP ((s, xi)|c) = p. Finally, if some element ui ∈ U is coveredby at least one of the subsets in S, we add a directed edge(xi, ui) in G. For each set Sj ∈ S that covers item ui, weconsider a corresponding catalyst cj and set the probabilityP ((xi, ui)|cj) = 1.

Now, we ask for a solution of s-t TOP-k CATALYSTS onthe uncertain graph G constructed by using k + 1 catalysts.Every solution to our problem necessarily takes catalystc, as otherwise there would be no way to connect s to t.Moreover, given that the paths connecting s to t are alldisjoint, and each of them exists with probability < 1 (asp < 1), the reliability from s to t is maximized by selectingk additional catalysts that make the maximum number ofpaths exist, or, equivalently, selecting k other catalysts thatmake each of the edges (xi, ui) exist with probability 1. Inorder for each edge (xi, ui) to exist with probability 1, itsuffices to have selected only one of the catalysts that areassigned to (xi, ui). Thus, selecting k catalysts that maxi-mize the number of edges (xi, ui) existing with probability1 corresponds to selecting k subsets Sj that maximize thenumber of elements covered. Hence, the theorem.

Apart from being NP-hard, Problem 1 is also not easyto approximate, as it does not admit any Polynomial TimeApproximation Scheme (PTAS).

Theorem 2. Problem 1 does not admit any PTAS, unless P =NP.

Proof. See Appendix.

As a further evidence of the difficulty of our problem, itturns out that neither submodularity nor supermodularityholds for the objective function therein. Thus, standardgreedy hill-climbing algorithms do not directly come withapproximation guarantees. Non-supermodularity easily fol-lows from NP-hardness (as maximizing supermodular setfunctions under a cardinality constraint is solvable in poly-nomial time), while non-submodularity is shown next witha counter-example.

Fact 1. The objective function of Problem 1 is not submodular.

A set function f is submodular if f(A ∪ {x}) − f(A) ≥f(B ∪ {x}) − f(B), for all sets A ⊆ B and all ele-ments x /∈ B. Look at the example in Figure 2. LetC1 = {c2}, C2 = {c1, c2}. We find that R ((s, t)|C1) =0, R ((s, t)|C1 ∪ {c3}) = 0, R ((s, t)|C2) = 0.3, andR ((s, t)|C2 ∪ {c3}) = 0.475. Clearly, submodularity doesnot hold in this example.

3 SINGLE-SOURCE SINGLE-TARGET: BASELINES

In this section, we present two simple baseline approachesand discuss their limitations (Sections 3.1 and 3.2). Then, inSection 4, we propose a more sophisticated algorithm thataims at overcoming the weaknesses of such baselines.

4

3.1 Individual top-k baseline

The most immediate approach to our s-t TOP-k CATALYSTS

problem consists of estimating the reliability R ((s, t)|{c})between the source s and the target t attained by eachcatalyst c ∈ C individually, and then outputting the top-k catalysts that achieve the highest individual reliability.

Time complexity. For each catalyst, we can estimate reli-ability via Monte Carlo (MC) sampling2: sample a set of Kdeterministic graphs from the input uncertain graph, and es-timate reliability by summing the (normalized) probabilitiesof the graphs where the target is reachable from the source.The time complexity of MC sampling for a single catalyst isO(K(n+m)), where n and m denote the number of nodesand edges in the input uncertain graph, respectively. Hence,the overall time complexity of the Individual top-k baselineis O(|C|K(n+m)+ |C| log k), where the last term is due totop-k search.

Shortcomings. The Individual top-k algorithm suffers fromboth accuracy and efficiency issues.

• Accuracy: This baseline is unable to capture thecontribution of paths containing different catalysts.For example, in Figure 2, the individual reliabilityattained by each catalyst is 0. Thus, if we are toselect the top-2 catalysts, there will be no way todiscriminate among catalysts, which will be pickedat random. Instead, in reality, the top-2 set is {c1, c2}.

• Efficiency: To achieve good accuracy, MC samplingtypically requires around thousands of samples [23].Performing such a sampling for each of the |C|catalysts can be quite expensive on large graphs (|C|may be up to the order of thousands as well, seeSection 6).

3.2 Greedy baseline

A more advanced baseline consists of greedily selecting thecatalyst that brings the maximum marginal gain to the totalreliability, until k catalysts have been selected. More pre-cisely, assuming that a set C1 of catalysts has been alreadycomputed, in the next iteration this Greedy baseline selectsa catalyst c∗ such that:

c∗ = argmaxc∈C\C1

[R ((s, t)|C1 ∪ {c})−R ((s, t)|C1)]

Note that, since the s-t TOP-k CATALYSTS problem is neithersubmodular nor supermodular, this greedy approach doesnot achieve any approximation guarantees.

Time complexity. The time complexity of each iteration ofthe greedy baseline is O(|C|K(n + m)), as we need toestimate the reliability achieved by the addition of eachcatalyst in order to choose the one maximizing the marginalgain. For a total of k iterations (top-k catalysts are to bereported), the overall complexity is O(|C|kK(n+m)).

Shortcomings. While being more sophisticated thanIndividual top-k, the Greedy baseline still suffers from bothaccuracy and efficiency issues.

• Accuracy: Although Greedy partially solves the ac-curacy issue related to the presence of paths withmultiple catalysts, such an issue is still present atleast in the initial phases of this second baseline.

S

t

e1 e3

e2

e4 e5

e6

��

�

��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Fig. 3: Difficulties with the Greedy baseline. P (e|c) = 0 for all edge-catalystcombinations that are not present in the table

For example, in Figure 2 the individual reliabilityattained by each catalyst is 0. Therefore, in the firstiteration the Greedy algorithm has no informationto properly select a catalyst, thus ending up with acompletely random choice. If c3 is selected as a firstcatalyst, then the second catalyst selected would bec1. Thus, Greedy would output {c1, c3}, while thetop-2 set is {c1, c2}. We refer to this issue as “cold-start” problem.

• Efficiency: MC sampling is performed |C|k times.This is more inefficient than the Individual top-k.

Example 1. We demonstrate the cold-start problem associatedwith the Greedy baseline with a running example in Figure 3.Assume top-k=3. The individual reliability from s to t, attainedby each of the four catalysts is 0. Therefore, in the first iteration,the Greedy algorithm selects a catalyst uniformly at random, sayc4. Then, the second catalyst selected would be c1; since c1, in thepresence of c4, provides the maximum marginal gain comparedto any other catalyst. Similarly, in the third round, Greedy willselect c2 due to its higher marginal gain. Therefore, total reliabilityachieved by Greedy is: R

(

(s, t)|{c4, c1, c2})

= 1− (1 − 0.5×0.5)(1 − 0.8 × 0.7 × 0.7) = 0.544. However, the top-3 set is{c1, c2, c3}, yielding reliability R

(

(s, t)|{c1, c2, c3})

= 0.8[1−(1−0.8)(1−0.7×0.7)] = 0.7184. This shows the sub-optimalityof the greedy baseline.

4 SINGLE-SOURCE SINGLE-TARGET:

PROPOSED METHOD

Here we describe the method we ultimately pro-pose to provide effective and efficient solutions to thes-t TOP-k CATALYSTS problem.

The main intuition behind our method directly followsfrom the shortcomings of the two baselines discussed above.Particularly, both baselines highlight how considering cat-alysts one at a time is less effective. This can easily beexplained as a single catalyst can bring information that isrelated only to single edges. Instead, what really matters incomputing the reliability between two nodes is the set ofpaths connecting the source and the target. This observationfinds confirmation in the literature [12].

Motivated by this, we design the proposed method ascomposed of two main steps. First, we select the top-r pathsexhibiting highest reliability from the source to the target.Second, we iteratively include these paths in the solution soas to maximize the marginal gain in reliability, while stillkeeping the constraint on total number of catalysts satisfied.Apart from the main advantage due to considering pathsinstead of individual catalysts, designing our algorithm ascomposed of two separate steps allows us to achieve high

2In this paper, we employ MC sampling as an oracle to estimate reliabilityin uncertain graphs. While more advanced sampling techniques exist, e.g., RHT[23], recursive stratified sampling [26], our contributions are orthogonal to them.We omit discussing advanced sampling methods for brevity.

5

Algorithm 1 Rel-Path

Require: Uncertain graph G = (V,E,C, P ), source node s ∈ V ,target node t ∈ V , positive integers k, r

Ensure: Subset of catalysts C∗ ⊆ C1: P ← Algorithm 2 on input (G, s, t, r)2: P1 ← Algorithm 3 on input (G, s, t, k), P3: C∗ ←catalysts present on P1

Algorithm 2 Top-r Most Reliable Paths Selection

Require: Uncertain graph G = (V,E,C, P ), source node s ∈ V ,target node t ∈ V , positive integer r

Ensure: P : top-r most reliable paths from s to t1: for all e ∈ E do2: let C(e) = {c1, c2, . . . , ci} be the set of all catalysts s.t.

P (e|cj) > 0, ∀j ∈ [1..i]3: replace e by i edges {e1, e2, . . . , ei}4: assign probability P (ej |cj) = P (e|cj)5: assign edge-weight W (ej) = − logP (ej |cj)6: end for7: P ← top-r shortest paths from s to t in the constructed

multigraph

Algorithm 3 Iterative Path Inclusion

Require: Top-r most-reliable path set P from source s to targett, positive integer k

Ensure: A subset of paths P1 ⊆ P1: P1 ← ∅2: while |P| > 0 and total #catalysts in P1 is ≤ k do3: P ∗ ← argmaxP∈P\P1

RelP1∪{P}(s, t)s.t. #catalysts in P1 ∪ {P

∗} is ≤ k4: P1 ← P1 ∪ {P

∗}5: P ← P \ {P ∗}6: end while

efficiency. Indeed, the first step can be efficiently solved byfast algorithms for finding the top-r shortest paths, whilethe second step requires MC sampling to be performed in asignificantly reduced version of the original graph.

The outline of the proposed method, which we callRel-Path, is reported in Algorithm 1. In the following weprovide the details of each of the two steps.

4.1 Step 1: Most-reliable paths selection

The first step of the proposed method consists of finding thetop-r most reliable paths from the source to the target. Givenan uncertain graph G = (V,E,C, P ), a source node s ∈ V ,and a target node t ∈ V , we first convert G into an uncertain,multigraph G′ (Algorithm 2). For each edge e = (u, v) ∈ E,let C(e) ⊆ C denote the set of all single catalysts such that∀c ∈ C(e) : P (e|c) > 0. Assume C(e) = {c1, c2, . . . , ci}.Then, we add i edges {e1, e2, . . . , ei} between u and vin the multigraph G′. To each newly constructed edge ej ,j ∈ [1..i], we assign a single catalyst C(ej) = cj and setP (ej |cj) = P (e|cj). It can be easily noted that G and G′ areequivalent in terms of our problem. The construction of G′

only serves the purpose of selecting the top-r most reliablepaths from s to t in such a way that, for each intermediatepair of nodes x, y along a path, a single edge (and, thus, asingle catalyst) among the many ones possibly created bythe G → G′ transformation, is picked up. The reliability ofa path is defined as the product of the edge-probabilitiesalong that path.

To ultimately compute the top-r most reliable paths, wefurther convert the uncertain multigraph G′ into an edge-weighted multigraph G′′ by assigning a weight − log(pe) toeach edge e with probability pe of G′. This way, the top-r most reliable paths in G′ will correspond to the top-rshortest paths in G′′. To compute the top-r shortest pathsin G′′, we apply the well-established Eppstein’s algorithm[16], which has time complexity O(|C|m + nlogn+ r).

Space complexity. We note that both G′ and G′′ can havesize at most |C| times more than the size of the originalgraph G = (V,E,C, P ). This is because in Algorithm 2, eachedge e of G is replaced by C(e) edges in G′ and G′′ (lines2-3), where C(e) denotes the set of all catalysts such that∀c ∈ C(e), P (e|c) > 0. Clearly, C(e) ≤ |C|. Therefore, bothG′ and G′′ can have at most |E||C| = m|C| edges, while stillhaving the same number of nodes as in the original graph.In summary, the size of G′ and G′′ is linear in that of theoriginal graph and in the number of catalysts. Based on ourexperimental results, this adds a very small overhead to theoverall space requirement (see Section 6).

Choice of r. The number r of paths is an input parameterwhich constitutes a knob to tradeoff between efficiency andaccuracy (a larger r leads to higher accuracy and lowerefficiency). In general, its choice depends on the input graphand the number of top-k catalysts. An easy yet effective wayto set it is to observe when the inclusion of the top-(r + 1)-th reliable path does not tangibly increase the reliabilitygiven by the top-r paths. We provide experimental resultson selecting r in Section 6.

4.2 Step 2: Iterative path inclusion

The second step of our Rel-Path method aims at selecting aproper subset from the top-r most-reliable path set so asto maximize reliability between source and target nodes,while also meeting the constraint on the number of outputcatalysts. Denoting by RelP(s, t) the reliability between sand t in the subgraph induced by a path set P , this stepformally corresponds to the following problem:

Problem 2 (ITERATIVE PATH INCLUSION). Given set P of top-r most reliable paths from s to t in multigraph G′, find a path setP∗ ⊆ P such that:

P∗ = argmaxP1⊆P

RelP1(s, t)

subject to |⋃

e∈P1C(e)| ≤ k (4)

The ITERATIVE PATH INCLUSION problem can be shownto be NP-hard via a reduction from MAX k-COVER. Theproof is analogous to the one in Theorem 1, we thus omit it.

Theorem 3. Problem 2 is NP-hard.

Algorithm. We design an efficient greedy algorithm (Al-gorithm 3) for the ITERATIVE PATH INCLUSION problem. Ateach iteration, we add a path P ∗ to the already computedpath set P1 which brings the maximum marginal gain interms of reliability. While selecting path P ∗, we also ensurethat the total number of catalysts used in the paths P1∪{P ∗}is no more than k. The algorithm terminates either whenthere is no path left in the top-r most reliable path set P , orno more paths can be added without violating the budget kon the number of catalysts. We report the catalysts presentin P1 as our final solution. If the total number of catalysts

6

1 3

2

4 5

6

��

�

��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) 3-paths froms to t

1

2

3

(b) 1st Iteration

1 2

2 3

(c) 2nd Iteration

Fig. 4: A demonstration of the Iterative Path Inclusion algorithm.

present in P1 is k′ < k, additional k − k′ catalysts that arenot in P1 can be selected with some proper criterion (e.g.,frequency on the non-selected paths).

Next, we demonstrate our Iterative Path Inclusion algo-rithm with the previous running example.

Example 2. In Figure 4(a), which is same as Figure 3, we have3 paths from s to t, i.e., P1 : e1e2, P2 : e3e4, and P3 : e3e5e6.Assume that there is a budget constraint of 3 catalysts. In the firstiteration, we select the path P2 since it has the highest reliabilitycompared to the two other paths. In the second iteration, P1 andP2 together have higher reliability than P2 and P3. However,the former combination requires 4 catalysts, thus violating theconstraint. Hence, we select P2 and P3. After that, the algorithmterminates as no more path can be included without violating theconstraint on catalysts.

Approximation guarantee. The Iterative Path Inclusion al-gorithm achieves approximation guarantee under some as-sumptions. If the top-r most reliable paths are node-disjoint(except at source and target nodes), Iterative Path Inclusion

exhibits an approximation ratio at least 1r .

Theorem 4. The Iterative Path Inclusion algorithm, under theassumption that the top-r most reliable paths are node-disjoint,achieves an approximation factor of:

1

kRel

(

1−

(

KC − kRel

KC

)kC

)

, (5)

where

KC = maxP1⊆P

{|P1| : |C (P1)| ≤ k} (6)

kC = minP1⊆P

{|P1| : |C (P1)| ≤ k} and

∀P ∈ P \ P1, |C(P1 ∪ {P})| > k} (7)

kRel = 1− minP∈P

RelP(s, t)− RelP\{P}(s, t)

Rel{P}(s, t). (8)

Proof. See Appendix.

In the above approximation-guarantee result, KC andkC , respectively, denote maximum and minimum size of themaximal feasible path set that can be derived from P . kRel

denotes the curvature of our optimization function, whichcan be shown to be submodular when paths in P are node-disjoint (see Appendix). Hence, in this case kRel ∈ (0, 1).Assuming that P contains at least one path having less thank catalysts, then in the worst case the approximation ratiois ≥ 1

KC≥ 1

r (where r is the total number of paths in the

top-r path set P). In other words, the approximation ratio isguaranteed to be at least 1

r .

Time complexity. Let us denote by n′ and m′ the number ofnodes and edges, respectively, in the subgraph induced bythe top-r most-reliable path set P . At each iteration, our iter-ative path selection algorithm performs MC sampling overthe subgraph induced by the selected paths. The number ofiterations is at most r. Thus, if K is the number of samplesused in each MC sampling, the iterative path selection algo-rithm takes O(r2K(n′+m′)) time. Including the time due tothe first step of selecting the top-r most reliable paths, we getthat the overall time complexity of the proposed Rel-Path

algorithm is O(|C|m+n logn+r2K(n′+m′)). We point outthat the subgraph induced by the top-r most reliable pathsis typically much smaller than the input graph G. Thus, ourRel-Path method is expected to be much more efficient thanthe two baselines introduced earlier. Experiments in Section6 confirm this claim.

5 MULTIPLE SOURCES AND TARGETS

Real-world queries often involve sets of source and/ortarget nodes, instead of a single source-target pair. As anexample, the topic-aware information cascade problem [4]asks for a set of early adopters who maximally influencea given set of target customers. Motivated by this, in thefollowing we discuss problems and algorithms for the casewhere multiple nodes can be provided as input. Such ageneralization opens the stage to various formulations ofthe problem. Here we focus on two variants: (1) maximizingan aggregate function over all possible source-target pairs(Section 5.1), and (2) maximizing connectivity among allquery nodes (Section 5.2). Note that our first problem for-mulation has a notion of “clique” connectivity, as it appliesan aggregate function over all pairs of query nodes.

5.1 Maximizing aggregate functions

We formulate our problem as follows.

Problem 3 (top-k catalysts w/ aggregate). Given an uncertaingraph G = (V,E,C, P ), a source set S ⊂ V , a target set T ⊂V , and a positive integer k, find a set C∗ of catalysts, havingsize k that maximizes an aggregate function F over conditionalreliability of all source-target pairs:

C∗ = argmaxC1⊆C

F〈s,t〉∈S×T

(

R ((s, t)|C1))

subject to |C1| = k. (9)

Being a generalization of the s-t TOP-k CATALYSTS prob-lem, Problem 3 can easily be recognized as NP-hard. Inthis work we consider three commonly-used aggregatefunctions: average, maximum, and minimum. These aggre-gates give rise to three variants of Problem 3 which werefer to TOP-k CATALYSTS AVG, TOP-k CATALYSTS MAX, andTOP-k CATALYSTS MIN, respectively.Motivating examples for multiple sources and targets.

• Average. Find the top-k catalysts such that the av-erage reliability over all 〈s, t〉 pairs is maximized.This is equivalent to maximization of the sum ofreliability over all 〈s, t〉 pairs. This problem occurs,e.g., in the topic-aware information cascade scenariowhen the campaigner wants to maximize the spreadof information to the entire target group.

7

• Maximum. Find the top-k catalysts such that thereliability of the 〈s, t〉 pair with the highest reliabilityis maximized. In the topic-aware information cascadeproblem, this is equivalent to the scenario that eachearly adopter is campaigning a different product ofthe same campaigner. The campaigner wants at leastone target user to be aware about one of her products(e.g., each target user might be a celebrity user inTwitter). Therefore, the campaigner would be willingto maximize the spread of information from at leastone early adopter to at least one target user.

• Minimum. Find the top-k catalysts such that the reli-ability of the 〈s, t〉 pair having the lowest reliability ismaximized. In the topic-aware information cascadesetting, this is equivalent to the problem that eachearly adopter is campaigning a different product ofthe same campaigner, and the campaigner wantsto maximize the minimum spread of her campaignfrom any of the early adopters to any of her targetusers. This is motivated, in reality, because only asmall percentage of the users who have heard abouta campaign will buy the corresponding product.

Overview of algorithms. In the following, we describethe algorithms that we develop for the aforementionedaggregate functions. In principle, they follow our earliertwo main steps: (1) finding the top-r paths (Algorithm 2),now between every pair of source and target nodes, andthen (2) iteratively include these paths so as to maximizethe marginal gain in regards to the objective function, whilestill keeping the constraint on total number of catalystssatisfied (Algorithm 3). Intuitively, finding the top-r pathsbetween each pair of source and target nodes is a naturalextension to our Rel-Path algorithm, this is because theaggregate function in Problem 3 is defined over all pairs ofsource-target nodes. Nevertheless, the exact process some-what varies according to the aggregate function at hand,which we shall discuss next. Unless otherwise specified,we assume that S ∩ T = ∅, that is, source and target setsare non-overlapping. We will discuss case by case how ouralgorithms can (easily) handle the case when S ∩ T 6= ∅.

Extending the baselines presented in Sections 3.1–3.2 tomultiple query nodes is instead straightforward (regardlessof the aggregate function). We thus omit the details.

5.1.1 Algorithm for maximum reliability

Our solution for the TOP-k CATALYSTS MAX problemis the most straightforward, compared to bothTOP-k CATALYSTS AVG and TOP-k CATALYSTS MIN

problems. First, for each 〈s, t〉 pair, we identify thetop-r most reliable paths. Then, separately for each 〈s, t〉pair, we also apply the Iterative Path Inclusion algorithmto find the top-k catalysts for that pair. Finally, we selectthe 〈s, t〉 pair which attains the maximum reliability. Wereport the corresponding top-k catalysts as the solution tothe TOP-k CATALYSTS MAX problem.

Time complexity. The time required to find the reliablepaths for all 〈s, t〉 pairs is O (|S||T | (m+ n logn+ r)). Sim-ilarly, the time complexity of the iterative path inclusionphase is O

(

|S||T |r2 (n′ +m′)K)

. There is an additionalcost O(|S||T |) to find the 〈s, t〉 pair with the maximum

reliability, which is, however, dominated by the time spentin path inclusion.

Overlapping source and target sets. If a node v is both inS and in T , the above solution will return an arbitrary setC1 of catalysts, since R ((v, v)|C1) = 1, and it will alwaysbe considered as the optimal solution. If this behavior isnot intended, we eliminate all such nodes in S ∩ T beforeapplying our algorithm.

5.1.2 Algorithm for average reliability

As earlier, we first identify the top-r most reliable pathsfor each 〈s, t〉 pair. However, we are now interested in theaverage reliability considering all source and target nodes,as opposed to that for individual source-target pairs. Thus,we consider all selected |S||T |r paths together, and applythe Iterative Path Inclusion algorithm to add paths so asto maximize the marginal gain in terms of our objectivefunction, while maintaining the budget k on total num-ber of catalysts. Recall that here our objective function is

1|S||T |

∑

〈s,t〉∈S×T

(

R ((s, t)|C1))

. Finally, catalysts present

in the selected paths are reported as a solution to theTOP-k CATALYSTS AVG problem.

The above steps remain identical, regardless of whetherS and T overlap or not.

Time complexity. The time required to find the reliablepaths for all 〈s, t〉 pairs is O (|S||T | (m+ n logn+ r)), aswe apply Eppstein’s algorithm for |S||T | times. Next, thetime complexity of the iterative path inclusion phase is

O(

(|S||T |r)2 (n′ +m′)K)

, where n′ and m′ are the num-

ber of nodes and edges of the subgraph induced by thereliable paths, and K is the number of samples used ineach MC sampling. Note that the time required for iterativepath inclusion of TOP-k CATALYSTS AVG is higher than thatfor the TOP-k CATALYSTS MAX, since we consider all |S||T |rpaths together in the former algorithm.

5.1.3 Algorithm for minimum reliability

We start again by finding the top-r most reliablepaths for each 〈s, t〉 pair. However, applying theIterative Path Inclusion algorithm, in this case, is more sub-tle. Specifically, if there are many 〈s, t〉 pairs and a lim-ited budget k of catalysts, spending too many catalystsfor a single 〈s, t〉 pair might prevent us from findinga path for another pair. This way there will be pairswith conditional reliability very low, thus the solution tothe TOP-k CATALYSTS MIN problem, i.e., the pair exhibitingminimum conditional reliability, would be quite poor. Tomitigate this issue, we consider an additional step wherewe find a minimum set of catalysts, before applying theIterative Path Inclusion algorithm. The subsequent steps re-main instead identical, regardless of whether S and Toverlap or not.

Finding minimum set of catalysts. The objective of thisstep is to select a minimum set of catalysts which ensurethat at least one path for every 〈s, t〉 pair exists. This stepcorresponds to the following problem.

Problem 4 (MINIMUM SET OF CATALYSTS). Given a source setS, a target set T , a set of paths P , an uncertain graph G =(V,E,C, P ) induced by P , find the smallest set C∗ ⊆ C of

8

s1

t1

1 2 2 3

3 4 4 3

5 4 6 3

7 3 8 1

9 3 10 1

11 1 12 1

13 3

15 2

14 3

16 2

s2

t1

s1

t2

s2

t2

Fig. 5: A demonstration of TOP-k CATALYSTS MIN solution: catalysts that influ-ence the edges are shown in the figure. However, the corresponding probabilitiesare not shown. Also, P (e|c) = 0 for all edge-catalyst combinations not shown.

catalysts, such that the conditional reliability R ((s, t)|C∗) foreach 〈s, t〉 pair is larger than zero:

C∗ = argminC1⊆C

|C1|

subject to R ((s, t)|C1) > 0, ∀〈s, t〉 ∈ S × T. (10)

Theorem 5. Problem 4 is NP-hard.

Proof. NP-hardness can easily be verified by noticing thatthe SET COVER problem can be reduced to a specific instanceof MINIMUM SET OF CATALYSTS where each path in P can beformed using a single catalyst. In this case, in fact, we ask forthe minimum number of catalysts required to cover at leastone path of every 〈s, t〉 pair, which exactly corresponds towhat SET COVER asks for.

Algorithm for MINIMUM SET OF CATALYSTS. We de-sign an algorithm to provide effective solutions toMINIMUM SET OF CATALYSTS which consists of four steps:

• Step 1. Mark all 〈s, t〉 pairs as disconnected.

• Step 2. For all disconnected 〈s, t〉 pairs, find a pathP that connects one of such 〈s, t〉 pairs, while addingthe minimum number of new catalysts to the set ofalready selected catalysts.

• Step 3. Mark that 〈s, t〉 pair as connected. Include thecatalysts in path P to the set of selected catalysts.

• Step 4. If there is at least one disconnected 〈s, t〉 pair,go to step 2.

We report the set of selected catalysts as our minimumset. If the size of this minimum set is more than k, weperform an additional step. From the selected set C∗, if asubset C′ can be removed, but a path can still be foundfor all connected 〈s, t〉 pairs with the remaining catalystsin C∗ \ C′, then C′ is removed from C∗. We illustrate ouralgorithm with an example below.

Example 3. As shown in Figure 5, let us assume that the sourceset is S = {s1, s2}, and the target set is T = {t1, t2}. The figureillustrates the top-2 most reliable paths for each source-target pair.Assume there is a budget k = 3 on the number of output catalysts.We now apply our algorithm. First, we select the catalyst c3, sincethis catalyst is sufficient to have an edge for the 〈s1, t1〉 pair. Then,we select the catalyst c4 because {c3, c4} together add an edge forthe 〈s2, t1〉 pair. Next, we consider the catalyst c1 in order tohave an edge for the pair 〈s1, t2〉. At this point, we have alreadysaturated the budget of 3 catalysts: {c1, c3, c4}, but we are yetto add an edge for the 〈s2, t2〉 pair. Thus, we delete c4 from the

selected set of catalysts, because this still allows a path for threepreviously connected source-target pairs. Finally, we add catalystc2 to the set. The final set {c1, c2, c3} of catalysts allows a pathfor all source-target pairs.

Time complexity. In each iteration, we find the pathwith the smallest number of new catalysts, which re-quires O (r|S||T | (n′ +m′)) time. Then, we also re-move the redundant catalysts, which requires an-other O (kr|S||T | (n′ +m′)) time. Since there can beat most |S||T | iterations, overall time complexity ofour MINIMUM SET OF CATALYSTS finding algorithm isO(

kr|S|2|T |2 (n′ +m′))

.Intuitively, the minimum set finding step ensures that,

given large enough budget on catalysts, there will be at leastone path for all 〈s, t〉 pairs. Therefore, the objective functionof the TOP-k CATALYSTS MIN problem will be guaranteedto be larger than zero. If our budget has not been ex-hausted yet and more catalysts can be added, we next applythe Iterative Path Inclusion algorithm as follows. At eachiteration, we find the 〈s, t〉 pair exhibiting the minimumconditional reliability. We then add a path that maintainsthe budget, while also maximizing the marginal gain inreliability for that 〈s, t〉 pair. The algorithm terminates whenno more paths can be added without exceeding the budgetk, or all top-r paths for all 〈s, t〉 pairs have been selected.

5.2 Maximizing connectivity

In the second variant of the s-t TOP-k CATALYSTS prob-lem applied to multiple query nodes, we do not dis-tinguish between source and target nodes. All querynodes are considered as peers: the objective of thisCONNECTIVITY TOP-k CATALYSTS problem is to find a setof top-k catalysts which maximize the probability that allquery nodes are connected in the subgraph induced byedges containing those catalysts. Therefore, our problemis inspired by the well-established problem of k-terminalreliability, whose objective is to compute the probability thata given set of nodes remain connected. An applicationof CONNECTIVITY TOP-k CATALYSTS problem is finding asuitable topic list of a thematic scientific event among re-searchers. The event would be successful not only whenthe invitees are experts on those topics, but also if they cannetwork with each other, that is, they can find connections(e.g., direct and indirect links formed due to research collab-orations) with other invitees based on those topics [34]. Weformally define our problem below.

Problem 5 (CONNECTIVITY TOP-k CATALYSTS). Given an un-certain graph G = (V,E,C, P ), a set of query nodes Q ⊂ V ,and a positive integer k, find a set C∗ of catalysts, with size k,that maximizes the probability of nodes in Q being connected onlyusing catalysts in C∗:

C∗ = argmaxC1⊆C

∑

G⊑G|C1

[JG(Q)× P (G|C1)]

such that |C1| = k. (11)

In the above statement, JG(Q) is an indicator functionover a possible deterministic graph G ⊑ G|C1 taking value1 if nodes in Q are all connected in G, and 0 otherwise. Forsimplicity, in directed graphs we consider a weak notion ofconnectivity, i.e., connectivity disregarding edge-directions.The extension to strong connectivity is straightforward.

9edge probabilities:dataset nodes edges catalysts mean, SD, quartiles

DBLP 1 291 297 3 561 816 347 0.21, 0.08, {0.181, 0.181, 0.181}BioMine 1 045 414 6 742 943 20 0.27, 0.17, {0.116, 0.216, 0.363}Freebase 28 483 132 46 708 421 5 428 0.50, 0.24, {0.250, 0.500, 0.750}

TABLE 1: Characteristics of the uncertain graphs used in the experiments.

Algorithm. The CONNECTIVITY TOP-k CATALYSTS problemis a generalization of the s-t TOP-k CATALYSTS basic prob-lem (Problem 1). Thus, it can be immediately be recognizedas NP-hard.

To provide high quality results, we design a two-step algorithm whose outline is similar in spirit tothe Rel-Path algorithm proposed for s-t TOP-k CATALYSTS.As a main difference, however, since our goal inCONNECTIVITY TOP-k CATALYSTS is to maximize connectiv-ity among a set of peer nodes, we ask for the top-r minimumSteiner trees as a first step of the algorithm (rather than top-rmost reliable paths between a single source-target pair). ASteiner tree for a set Q of nodes in a weighted graph is atree that spans all nodes of Q. A minimum Steiner tree is aSteiner tree whose sum of edge-weights is the minimum. Wefirst apply the technique proposed in [15] to find the top-rminimum Steiner trees from an equivalent edge-weighted,multi-graph G′′. We recall that G′′ can be obtained from theinput uncertain graph G by following Algorithm 2. Next,we iteratively include the Steiner trees in our solution soas to maximize the marginal gain in the probability thatnodes in Q are connected, while not exceeding the budgeton catalysts.

Time complexity. The complexity to find the top-r mini-

mum Steiner trees is O(

3|Q|n+ 2|Q| ((|Q|+ log n)n+ e))

[15]. As for Iterative Path Inclusion, the complexity of ouriterative tree inclusion method is O(r2(n′ +m′)K), where,we recall, K is the number of samples used in each MC

sampling, and n′ and m′ are the number of nodes and edgesin the subgraph induced by the top-r minimum Steinertrees, respectively.

6 EXPERIMENTAL EVALUATION

We report empirical results to show accuracy, efficiency, andmemory usage of the proposed methods. We also provideresults on information diffusion to demonstrate the applica-bility of the top-k catalysts identified by our methods. Wereport sensitivity analysis by varying all main parameters:the number of catalysts, reliable paths, query nodes, andthe distance between source and target nodes. The code isimplemented in C++ and experiments are performed on asingle core of a 100GB, 2.26GHz Xeon server.

6.1 Experimental setup

Datasets. We use three real-world uncertain graphs.DBLP (http://dblp.uni-trier.de/xml). We use this well-

known collaboration network, downloaded on August 31,2016. Each node represents an author, and an edge denotesco-authorship. Each edge is defined by a set of keywords,that are present within the title of the papers, co-authoredby the respective authors. We selected 347 distinct keywordsfrom all paper titles, e.g., databases, distributed, learning,crowd, verification, etc, based on frequency and how wellthey represent various sub-areas of computer science. Wecount occurrences of a specific keyword in the titles of thepapers co-authored by any two authors. Edge probabilitiesare derived from an exponential cdf of mean µ = 5 to this

count [23]; hence, if a keyword c appeared t times in thetitles of the papers co-authored by the authors u and v, thecorresponding probability is p((u, v)|c)) = 1− exp−t/5. Theintuition is that the more the times u and v co-authored onkeyword c, the higher the chance (i.e., the probability) thatu influences v (and, vice versa) for that keyword. Therefore,keywords correspond to catalysts for information cascade.

BioMine (https://www.cs.helsinki.fi/group/biomine). This isthe database of the BIOMINE project [17]. The graph isconstructed by integrating cross-references from severalbiological databases. Nodes represent biological conceptssuch as genes, proteins, etc., and edges denote real-worldphenomena between two nodes, e.g., a gene “codes” fora protein. In our setting these phenomena correspond tocatalysts. Edge probabilities, which quantify the existenceof a phenomenon between the two endpoints of that edge,were determined in [17] as a combination of three criteria:relevance (i.e., relative importance of that relationship type),informativeness (e.g., degrees of the nodes adjacent to thatedge), confidence on the existence of a specific relationship(e.g., conformity with the biological STRING database).

Freebase (http://www. freebase.com). This is a knowledgegraph, where nodes are named entities (e.g., Google) orabstract concepts (e.g., Asian people), while edges repre-sent relationships among those entities (Jerry Yang is the“founder” of Yahoo!). Relationships corresponds to cata-lysts. We use the probabilistic version of the graph [10].Query selection. For each set of experiments, we select500 different queries. If we do not impose any distanceconstraint between the source and the target, both of themare picked uniformly at random. When we would like tomaintain a maximum pairwise distance d from the sourceto the target, we first select the source uniformly at random.Then, out of all nodes that are within d-hops from it, onenode is selected uniformly at random as the target. Allreported results are averaged over 500 such queries.Competing methods. We compare the proposed Rel-Path

method (Algorithm 1) to the two baselines, Individual top-kand Greedy, discussed in Sections 3.1–3.2. For the sake ofbrevity, in the remainder of this section we refer to theproposed method as Rel-Path, and to the Individual top-kbaseline as Ind-k.Reliability estimation. Our proposed method and the base-lines need a subroutine that estimates conditional reliabilityfor given source node(s), target node(s), and number ofcatalysts. To this end, we employ the well-established Monte

Carlo-sampling method. In particular, to improve efficiency,we combine MC sampling with a breadth first search fromthe source node (set) [23], meaning that the coin for estab-lishing if an edge should be included in the current sampleis flipped only upon request. This avoids to flip coins foredges in parts of the graph that are not reached with thecurrent breadth first search, thus increasing the chance ofan early termination. In the experiments, we found thatMC sampling converges at around K = 1000 samples inour datasets. This is roughly the same number observedin the literature [23], [32] for these datasets. Hence, we setK = 1000 in all sets of experiments.6.2 Single-source single-target

Experiments over different datasets. In Table 2, we showconditional reliability and running time of all competitors

10

conditional reliability running time (sec)dataset Ind-k Greedy Rel-Path Ind-k Greedy Rel-Path

Freebase 0.15 0.15 0.17 1.38 43 0.02BioMine 0.18 0.43 0.59 1220 26217 5.27DBLP 0.11 0.26 0.28 85.97 36519 1.07

TABLE 2: Reliability and efficiency over different datasets. Single source-targetpair, top-5 catalysts.

conditional reliability running time (sec)k Ind-k Greedy Rel-Path Ind-k Greedy Rel-Path

5 0.18 0.43 0.59 1220 26217 5.278 0.18 0.49 0.59 2210 67158 7.0510 0.18 0.50 0.60 2290 131674 7.3712 0.23 0.53 0.62 2305 161265 7.9815 0.34 0.53 0.63 2365 217496 8.30

TABLE 3: Reliability and efficiency with varying number k of output catalysts.Single source-target pair, BioMine dataset.

distance conditional reliability running time (sec)(# hops) Ind-k Greedy Rel-Path Ind-k Greedy Rel-Path

2 0.45 0.75 0.83 346 9798 4.904 0.08 0.38 0.64 406 23140 5.376 0.02 0.17 0.30 548 29135 5.58

TABLE 4: Reliability and efficiency with varying distance between the sourceand the target. Single source-target pair, BioMine dataset, top-5 catalysts.

for top-5 output catalysts. For our Rel-Path, we use top-20 most reliable paths with Freebase and BioMine and top-50 most reliable paths over DBLP, as we observe that,for finding the top-5 catalysts, increasing the number ofpaths beyond 20 (Freebase and BioMine) and 50 (DBLP) doesnot significantly increase the quality in respective datasets.Results with varying the number of most reliable paths, andits dependence on varying number of top-k catalysts, willbe reported shortly.

Conditional reliability illustrates the quality of the top-k catalysts found: the higher the reliability, the better thequality. The proposed Rel-Path achieves the best qualityresults on all our datasets.

Concerning running time, we observe that Rel-Path is 2-3 orders of magnitude faster than Ind-k, and 3-4 orders fasterthan Greedy. This confirms that performing MC samplingon a significantly reduced version of the input graph leadsto significant benefits in terms of efficiency, without affect-ing accuracy. Surprisingly, Greedy is orders of magnitudeslower than Ind-k. The reason is the following. Althoughonly a factor k separates Ind-k from Greedy based on ourcomplexity analysis, what happens in practice is that Ind-kbenefits from MC-sampling’s early termination much morethan Greedy, as Ind-k considers each catalyst individually,while Greedy considers a set of catalysts. One may alsonote that the running times over BioMine is higher thanthat over Freebase. Although Freebase has more nodes andedges, the graph is sparse compared to BioMine. Therefore,a breadth first search in BioMine often traverses more nodes,thus increasing its processing time.

Varying number of catalysts. We show results with varyingthe number k of output catalysts in Table 3 and Figure 6.Similar trends have been observed in all datasets, thus,for brevity, we report results on BioMine (Table 3, k variesfrom 5 to 15) and on DBLP (Figure 6, k varies from 10 to100). As expected, conditional reliability and running timeincrease with more catalysts. Moreover, as shown in Table 3,our Rel-Path remains more accurate and faster than bothbaselines for all k.

Varying distance from the source to the target. Table 4reports on results with varying the distance between thesource and the target. Keeping fixed the number of output

cond. reliability running time (sec) cond. reliability running time (sec)Rel-Path Rel-Path Rel-Path Rel-Path

r BioMine BioMine Freebase Freebase

1 0.27 4.29 0.12 0.00042 0.29 4.26 0.12 0.0023 0.31 4.30 0.13 0.0024 0.31 4.30 0.14 0.0035 0.31 4.31 0.14 0.004

10 0.32 4.31 0.16 0.00815 0.32 4.37 0.17 0.01320 0.33 5.26 0.17 0.01830 0.33 5.29 0.17 0.02050 0.33 5.38 0.17 0.035100 0.33 5.70 0.17 0.081

TABLE 5: Reliability and efficiency with varying number r of most reliablepaths in the proposed Rel-Path. Single source-target pair, top-5 catalysts.

0.25

0.3

0.35

0.4

0.45

5 20 40 60 80 100

Con

ditio

nal R

elia

bilit

y

# Top-r Reliable Paths

k=10k=20k=30

k=40k=50

k=100

(a) Conditional Reliability

0.6

1

1.4

1.8

5 20 40 60 80 100

Run

ning

Tim

e (S

ec)


k=10k=20k=30

k=40k=50

k=100

(b) Running Time

Fig. 6: Reliability and efficiency with varying number r of most reliable paths inthe proposed Rel-Path. Single source-target pair, number of top-k catalysts varyfrom k=10 to k=100, DBLP.

0

20

40

60

80

10 20 30 40 50 100Suff

icie

nt #

Top

-r P

aths

#Top-k CatalystsFig. 7: Sufficient number of top-r reliable paths to find the top-k catalysts byRel-Path, number of top-k catalysts vary from k=10 to k=100, DBLP.

Datasets Memory Usage

DBLP (1.3M, 3.6M) 1.9 GBBioMine (1.0M, 6.7M) 1.8 GBFreebase (28.5M, 46.7M) 16.0 GB

TABLE 6: Memory usage for Rel-Path

1.881.921.96

10 20 40 60 80 100

Mem

ory

Usa

ge

(G

B)


BioMine

Fig. 8: Varying #rel. paths

catalysts, as expected, the reliability achieved by all threemethods decreases with larger distance from the source tothe target. However, we observe that the reliability dropssharply for Ind-k. This is because with increasing distance,it becomes less likely that there would be a path due toonly one catalyst from the source to the target. We alsonote that the reliability decreases more in Greedy than inthe proposed Rel-Path. This is due to the cold-start problemof Greedy: It is more likely for Greedy to make mistakenchoices in the initial steps if the source and the target areconnected by longer paths.

Varying number of most reliable paths. We also testRel-Path for different values of the number r of most reliablepaths discovered in the first step of the algorithm. We reportthese results for BioMine and Freebase in Table 5, and forDBLP in Figures 6, 7. For BioMine and Freebase datasets, wefix the number k of output catalysts as 5, and we observethat while increasing the number of paths, the reliabilityinitially increases, then saturates at a certain value of r (e.g.,r = 20 for BioMine and r = 15 for Freebase). This behavior isexpected, since the subsequent paths have very small relia-bility. Hence, including them does not significantly increasethe quality of the solution found so far. On the other hand,the running time increases almost linearly when more top-rpaths are considered.

A similar behavior is observed in the DBLP dataset. Herewe additionally vary the number k of output catalysts from

11

0.010.020.040.080.160.32

2:2 3:3 5:5Con

ditio

nal R

elia

bilit

y

#Source:#Target

Ind-kGreedy

Rel-Path

(a) Conditional Reliability

102103104105106

2:2 3:3 5:5

Run

ning

Tim

e (S

ec)

#Source:#Target

Ind-kGreedy

Rel-Path

(b) Running Time

Fig. 9: Reliability and efficiency for multiple source-target pairs: Freebase, top-5catalysts, aggregate function = minimum.

0.550.600.650.700.750.800.85

2:2 3:3 5:5Con

ditio

nal R

elia

bilit

y

#Source:#Target

Ind-kGreedy

Rel-Path

(a) Reliability

102103104105106

2:2 3:3 5:5R

unni

ng T

ime

(Sec

)

#Source:#Target

Ind-kGreedy

Rel-Path

(b) Running Time

Fig. 10: Reliability and efficiency for multiple source-target pairs: Freebase, top-5catalysts, aggregate function = maximum

0.02

0.10

0.20

0.30

2:2 3:3 5:5Con

ditio

nal R

elia

bilit

y

#Source:#Target

Ind-kGreedy

Rel-Path

(a) Reliability

100101102103104105106

2:2 3:3 5:5

Run

ning

Tim

e (S

ec)

#Source:#Target

Ind-kGreedy

Rel-Path

(b) Running Time

Fig. 11: Reliability and efficiency for multiple source-target pairs: DBLP, top-10catalysts, aggregate function = average

0 50

100 150 200 250

2:2 3:3 5:5 10:1020:20

50:50100:100

Run

ning

Tim

e (S

ec)

#Source:#Target

(a) BioMine, Avg. Aggregate

10 30

100 300

1000 5000

2:2 3:3 5:5 10:1020:20

50:50100:100

Run

ning

Tim

e (S

ec)

#Source:#Target

(b) Freebase, Min. Aggregate

Fig. 12: Scalability with many sources-targets for Rel-Path: top-20 catalysts

10 to 100 (Figure 6), and we find that, as k increases, alarger set of reliable paths need to be considered to makeaccuracy stabilize. For instance, for k = 10, about r = 20reliable paths suffice to observe no more tangible accuracyimprovement. On the other hand, for k = 100, r = 60paths are required (Figure 7). Once again, this behavior isexpected: the larger the number k of catalysts to be output,the larger the subgraph connecting source to target to beexplored, and, hence, the larger the number of paths to beconsidered so as to satisfactorily cover that subgraph.

Memory usage. We report the memory usage of Rel-Path

in Table 6. This is dominated by the space required for thegraph in the main memory. The top-r reliable paths selectedby our algorithm consumes only a few tens of megabytes.Moreover, the memory consumption increases linearly withthe number of reliable paths selected (Figure 8).

6.3 Multiple-sources multiple-targets

Aggregate functions. We perform experiments to evaluatethe reliability and efficiency of our methods that maximizean aggregate function over conditional reliabilities for manysource-target pairs. We consider Minimum aggregate func-tion, and vary the number of source and target nodes from2 to 5. In these experiments, we fix the maximum distancebetween any source-target pair as 4. We also ensure that thesame node is not included both in source and target sets.

We show the performance of our algorithms over Free-base (Figure 9). Similar to queries with single source-target

Connectivity Running Time (Sec)Datasets Ind-k Greedy Rel-Path Ind-k Greedy Rel-Path

Freebase 0.01 0.10 0.10 1 908 13 175 80BioMine 0.29 0.47 0.71 893 37 992 310DBLP 0.30 0.33 0.35 306 116 340 85

TABLE 7: Connectivity query with 4 nodes, top-5 catalysts

pairs, Rel-Path outperforms Ind-k and Greedy both in termsof efficiency and conditional reliability. Particularly, due tothe presence of multiple source-target pairs, running timedifferences scale up, and Rel-Path is at least four orders ofmagnitude faster than the baselines.

We find that with more source-target pairs, the minimumreliability achieved decreases (Figures 9(a)). This can beexplained as follows. As we keep the number of top-kcatalysts fixed at k = 5, with more source and target nodes,the likelihood of getting one source-target pair with smallreliability attained by those top-k catalysts increases.

Different aggregate functions and datasets. We demon-strate how our aggregate functions perform over Freebaseand DBLP, in Figures 10 and 11, respectively. Due to com-mon trends, we only show Maximum over Freebase andAverage over DBLP. We find that Rel-Path results in betterreliability compared to Greedy over all experiments. Theirdifference minimizes in both datasets with more source-target pairs, which is due to the fact that we keep thenumber of top-k catalysts fixed at k = 5 (for Freebase) andat k = 10 (for DBLP). As before, Rel-Path is at least four tofive orders of magnitude faster than Greedy in all scenarios.In particular, Greedy requires about 105 seconds to answera single query, which makes almost infeasible to apply thisbaseline technique in any real-world online application.

It is interesting that with more source and target pairs,the maximum reliability increases (Figure 10), but the aver-age reliability decreases (Figure 11). This is expected sincewith more source-target pairs, the chance of getting onepair with higher reliability also increases, thereby improvingthe maximum reliability. On the contrary, as we considermore source-target pairs while keeping the total number ofcatalysts same, the average reliability naturally decreases.

Scalability with many sources and targets. We demonstratescalability of our Rel-Path algorithm with multiple sourceand target nodes (up to 100×100=10K source-target pairsand top-20 catalysts) in Figure 12. We observe that therunning time of Rel-Path increases almost linearly with thenumber of source-target pairs. Note that we do not reportrunning times of the Ind-k and Greedy baselines, as they donot scale beyond a small number of sources and targets, asshown earlier in Figures 9, 10, and 11.

Connectivity maximization. We illustrate the performanceof our algorithms that maximize connectivity (defined inSection 5.2) across multiple query nodes. For these exper-iments, we select 4 query nodes with maximum pairwisedistance between any two nodes fixed at 2. We compare theconnectivity attained by top-5 catalysts in Table 7. It can beobserved that Greedy and Rel-Path perform equally wellin Freebase, whereas Rel-Path results in higher connectivityover BioMine and DBLP. We further analyze the top-20Steiner trees retrieved in BioMine, and find that each ofthese Steiner trees require 3∼5 distinct catalysts. Therefore,in this dataset, Greedy makes more mistakes at initial stages.Because of the complexity of the top-20 Steiner tree findingalgorithm, Rel-Path requires more running time in these

12

0 200 400 600 800

1000

400 600 800 1000

Inf.

Spr

ead

#Target Nodes

Rel-PathRandom

All

(a) Information Spread

500

1000

1500

2000

2500

3000

400 600 800 1000

Run

ning

Tim

e (S

ec)

# Target Nodes

Rel-Path

(b) Running Time

Fig. 13: (a) Expected information spread by the top-10 catalysts and (b) runningtime to find the top-10 catalysts: DBLP, DB source nodes, DB target nodes

0 200 400 600 800

1000

400 600 800 1000

Inf.

Spr

ead

#Target Nodes

Rel-PathRandom

All

(a) Information Spread

500

1000

1500

2000

400 600 800 1000

Run

ning

Tim

e (S

ec)

# Target Nodes

Rel-Path

(b) Running Time

Fig. 14: (a) Expected information spread by the top-10 catalysts and (b) runningtime to find the top-10 catalysts: DBLP, ARCH source nodes, DB target nodes

experiments. However, Rel-Path is still significantly fasterthat the other two baselines over all our datasets.

6.4 Application in information cascadeHere we showcase our top-k catalysts problem in thecontext of information diffusion over social networks. Wepresent our results over the DBLP dataset.

We select top-k catalysts (i.e., keywords) according to theAverage aggregate function for multiple sources and targets(see Section 5). As discussed earlier, if u and v co-authoredmore on keyword c, the higher is the chance (i.e., the proba-bility) that u influences v (and, vice versa) for that keyword.Therefore, keywords correspond to catalysts for informationcascade. The ultimate goal of this application is to show that thecatalysts selected by our method effectively accomplish the task ofmaximizing the expected spread of information between the sourcenodes and the target nodes. To this purpose, we measure theexpected spread achieved by the top-k catalysts selected byour method, and compare it to the expected spread achievedby (i) k random catalysts, and (ii) all catalysts.Source nodes from Databases. We find the top-10 authorshaving the maximum number of publications in top-tierdatabase conferences and journals. They are: {Divesh Sri-

vastava, Surajit Chaudhuri, Jiawei Han, Philip S. Yu, Hector

Garcia-Molina, Jeffrey F. Naughton, H. V. Jagadish, Michael

Stonebraker, Beng Chin Ooi, Raghu Ramakrishnan}.Source nodes from Computer Architecture. In an analogousmanner, we select the top-10 authors from the computerarchitecture domain: {Alberto L. Sangiovanni-Vincentelli,Jingsheng Jason Cong, Massoud Pedram, Andrew B. Kahng,Robert K. Brayton, Yao-Wen Chang, David Blaauw, Miodrag

Potkonjak, Kaushik Roy, Xianlong Hong}.Target nodes from Databases. We consider authors havingat least 5 publications in top-tier database conferences andjournals as our target nodes. We vary the number of targetnodes from 400 to 1000, selected uniformly at random fromthem, to demonstrate the scalability of our algorithm.

In Figures 13(a) and 14(a), we show the expected in-formation spread achieved by the top-10 catalysts selectedvia our Rel-Path method, under two scenarios, respec-tively, Case-1: both source nodes and target nodes are fromdatabases (DB), Case-2: source nodes are from architecture(ARCH), and target nodes from databases (DB). To demon-strate the quality of our results, we report the expectedinformation spread achieved by uniformly at random selec-tion of 10 catalysts (denoted as “Random” in the figures). Weobserve from Figures 13(a) and 14(a) that Rel-Path selects

high-quality catalysts, and significantly outperforms such aRandom method.

In particular, the catalysts selected by Rel-Path underCase-1 are all DB-related, e.g., database systems, relational,information extraction, keyword search, XML, data mining,etc. On the other hand, the catalysts selected by Rel-Path

under Case-2 belong to DB or ARCH areas, e.g., CMOS,FPGA, storage system, cache, VLSI circuit, On-chip, trans-actional memory, data stream, etc. Both in Figures 13(a) and14(a), we also report the total information spread achievedby all 5 428 catalysts (i.e., keywords) present in the DBLPdataset. This is denoted as “All” in the figures. We find thatthe information spread achieved by only the top-10 catalysts isgenerally within 70-90% of the total information spread achievedby all 5 428 catalysts. These results demonstrate the relevanceof our novel problem and its solution in the domain ofinformation cascade over social influence networks.

Furthermore, we find that that running time to findthe top-k catalysts via Rel-Path increases almost linearlywith more target nodes (see Figures 13(b) and 14(b)), whichillustrates the scalability of our technique.

7 RELATED WORKTo the best of our knowledge, the problem of finding thetop-k catalysts for maximizing the conditional reliability,that we study in this work, is novel. In the following, weprovide an overview of relevant work in neighboring areas.Reliability queries in uncertain graphs. Reliability is aclassic problem studied in systems and device networks [3].Reliability has been recently studied in the context of largesocial and biological networks. Due to its #P-completeness[5], efficient sampling, pruning, and indexing methods havebeen considered [23], [24], [26], [32], [37].Constrained reachability queries. Mendelson and Woodshow that finding all simple paths in a (deterministic) graphmatching a regular expression is NP-hard [30]. There aresome query languages which support regular expressionqueries only in some restricted form, e.g., GraphQL, SoQL,GLEEN, XPATH, and SPARQL. Fan et. al. [19] study aspecial case of regular expressions that can be solved inquadratic time. Edge-label constrained reachability and dis-tance queries have been studied in [8], [22].

Label-constrained reachability queries have been alsoconsidered in the context of uncertain graphs [10]. However,in that work the goal was to estimate the reliability betweentwo nodes under the constraint that paths connecting thetwo nodes contain only some admissible labels. Thus, theinput graph still has fixed edge probabilities that do not varybased on external conditions. As a result, label-constrainedreachability differs from conditional reliability introduced inthis work, and, more importantly, our problem of finding thetop-k external conditions is not addressed in those works.Explaining relationships among entities. Several worksaim at identifying the best subgraphs/paths to describe howsome input entities are related [18], [20], [34]. Sun et. al.propose PathSIM [35] to find entities that are connectedby similar relationship patterns. However, all these worksconsider deterministic graphs. The semantics behind thenotion of connectivity in uncertain graphs is different.Uncertain graphs with correlated edge probabilities. Al-though the bulk of the literature on uncertain graphs as-sumes edges to be independent of one another [7], [10],

13

[23], [26], some works deal with correlated edge proba-bilities, where the existence of an edge may depend onthe existence of other edges in the graph (typically, edgessharing an end node) [13], [14], [32], [36]. Another modelthat differs from the classic one is the one adopted by Liuet. al. [28], which considers that every edge is assigned a(discrete) probability density function over a set of possibleedge weights. However, none of those works model edge-existence probabilities conditioned on external factors, northey study the problem of finding the top-k factors thatmaximize the reliability between two (sets of) nodes.Topic-aware influence maximization. The classical problemof influence maximization has been recently considered ina topic-aware fashion [6], [11]. Although the input to thatproblem is similar to the input considered in this work (anuncertain graph where edge probabilities depend on someconditions), topic-aware influence maximization solves adifferent problem, i.e., finding a set of seed nodes thatmaximize the spread of information for a given topic set.Topic-aware influence maximization can however benefitfrom the solutions provided by our top-k catalysts problem,e.g., in the case where topics are not known in advance.A recent work by Li et. al. [27] focuses on the problem offinding a size-k tag set that maximizes the expected spreadof influence started from a given source node. Our workis different as we aim at finding the top-k external factorsmaximizing the reliability between two given (sets of) nodes.Difference with our prior work. A preliminary versionof this work was published as a short paper in [25]. Thepresent version contains a lot of new significant material: acomplete piece of research work concerning the generaliza-tion to the case of multiple source/target nodes, includingproblem formulations, theory, applications, algorithms, andexperiments; important theoretical findings; more details,examples, and motivations for all the proposed algorithms,including detailed time-complexity analyses; a lot of ad-ditional experiments, including applications in informationcascade; a detailed overview of the related literature.

8 CONCLUSIONS

We formulated and investigated a novel problem of iden-tifying the top-k catalysts that maximize the reliabilitybetween source and target nodes in an uncertain graph.We proposed a method based on iterative reliable-pathinclusion. Our experiments show that the proposed methodachieves better quality and significantly higher efficiencycompared to simpler baselines. In future, we shall considermore complex relationships between an edge and the cata-lysts, and other problems from the perspective of top-k cat-alysts, e.g., nearest neighbors and influence maximization.

REFERENCES

[1] E. Adar and C. Re. Managing Uncertainty in Social Networks.IEEE Data Eng. Bull., 2007.

[2] C. Aggarwal. Managing and Mining Uncertain Data. Springer, 2009.[3] K. K. Aggarwal, K. B. Misra, and J. S. Gupta. Reliability Evaluation

A Comparative Study of Different Techniques. Micro. Rel., 14(1),1975.

[4] S. Aral and D. Walker. Creating Social Contagion Through ViralProduct Design: A Randomized Trial of Peer Influence in Net-works. Management Science, 57(9):1623–1639, 2011.

[5] M. O. Ball. Computational Complexity of Network ReliabilityAnalysis: An Overview. IEEE Tran. on Reliability, 1986.

[6] N. Barbieri, F. Bonchi, and G. Manco. Topic-Aware Social InfluencePropagation Models. In ICDM, 2012.

[7] P. Boldi, F. Bonchi, A. Gionis, and T. Tassa. Injecting Uncertaintyin Graphs for Identity Obfuscation. PVLDB, 5(11):1376–1387, 2012.

[8] F. Bonchi, A. Gionis, F. Gullo, and A. Ukkonen. Distance Oraclesin Edge-Labeled Graphs. In EDBT, 2014.

[9] M. Chaplin and C. Bucke. Enzyme Technology. Cambridge Univer-sity Press, 1990.

[10] M. Chen, Y. Gu, Y. Bao, and G. Yu. Label and Distance-ConstraintReachability Queries in Uncertain Graphs. In DASFAA, 2014.

[11] S. Chen, J. Fan, G. Li, J. Feng, K. L. Tan, and J. Tang. Online Topic-aware Influence Maximization. PVLDB, 8(6):666–677, 2015.

[12] W. Chen, C. Wang, and Y. Wang. Scalable Influence Maximizationfor Prevalent Viral Marketing in Large-Scale Social Networks. InKDD, 2010.

[13] Y. Cheng, Y. Yuan, G. Wang, B. Qiao, and Z. Wang. Efficient Sam-pling Methods for Shortest Path Query over Uncertain Graphs. InDASFAA, 2014.

[14] Y.-R. Cheng, Y. Yuan, L. Chen, and G.-R. Wang. Threshold-BasedShortest Path Query over Large Correlated Uncertain Graphs.JCST, 30(4):762–780, 2015.

[15] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. FindingTop-k Min-Cost Connected Trees in Databases. In ICDE, 2007.

[16] D. Eppstein. Finding the k Shortest Paths. SIAM J. Comput,28(2):652–673, 1998.

[17] L. Eronen and H. Toivonen. Biomine: Predicting Links be-tween Biological Entities using Network Models of HeterogeneousDatabases. BMC Bioinformatics, 13(1), 2012.

[18] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast Discovery ofConnection Subgraphs. In KDD, 2004.

[19] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding RegularExpressions to Graph Reachability and Pattern Queries. In ICDE,2011.

[20] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. REX: ExplainingRelationships Between Entity Pairs. PVLDB, 5(3):241–252, 2011.

[21] R. K. Iyer and J. A. Bilmes. Submodular Optimization withSubmodular Cover and Submodular Knapsack Constraints. InNIPS, 2013.

[22] R. Jin, H. Hong, H. Wang, N. Ruan, and Y. Xiang. ComputingLabel-Constraint Reachability in Graph Databases. In SIGMOD,2010.

[23] R. Jin, L. Liu, B. Ding, and H. Wang. Distance-Constraint Reach-ability Computation in Uncertain Graphs. PVLDB, 4(9):551–562,2011.

[24] A. Khan, F. Bonchi, A. Gionis, and F. Gullo. Fast Reliability Searchin Uncertain Graphs. In EDBT, 2014.

[25] A. Khan, F. Gullo, T. Wohler, and F. Bonchi. Top-k Reliable EdgeColors in Uncertain Graphs. In CIKM, 2015.

[26] R. Li, J. X. Yu, R. Mao, and T. Jin. Efficient and AccurateQuery Evaluation on Uncertain Graphs via Recursive StratifiedSampling. In ICDE, 2014.

[27] Y. Li, J. Fan, D. Zhang, and K.-L. Tan. Discovering Your SellingPoints: Personalized Social Influential Tags Exploration. In SIG-MOD, 2017.

[28] Z. Liu, C. Wang, and J. Wang. Aggregate Nearest NeighborQueries in Uncertain Graphs. In WWW, 2014.

[29] J.-L. Ma, B.-C. Yin, X. Wu, and B.-C. Ye. Simple and Cost-Effective Glucose Detection Based on Carbon Nanodots Supportedon Silver Nanoparticles. Analytical Chemistry, 89(2):1323–1328,2017.

[30] A. O. Mendelzon and P. T. Wood. Finding Regular Simple Pathsin Graph Databases. SIAM J. Comput., 24(6):1235–1258, 1995.

[31] E. F. Murphy, S. G. Gilmour, and M. J. C. Crabbe. Efficient andCost-Effective Experimental Determination of Kinetic Constantsand Data: The Success of a Bayesian Systematic Approach toDrug Transport, Receptor Binding, Continuous Culture and CellTransport Kinetics. FEBS Letters, 556(1):193 – 198, 2004.

[32] M. Potamias, F. Bonchi, A. Gionis, and G. Kollios. k-NearestNeighbors in Uncertain Graphs. PVLDB, 3(1-2):997–1008, 2010.

[33] M. Sieff. Why Hillary Clinton Lost Her Blue Wall.http://www.martinsieff.com/ cycles-of-change/ hillary-clinton-lost-blue-wall/, 2016.

[34] M. Sozio and A. Gionis. The Community-search Problem andHow to Plan a Successful Cocktail Party. In KDD, 2010.

[35] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. PathSim: MetaPath-Based Top-K Similarity Search in Heterogeneous InformationNetworks. PVLDB, 4(11):992–1003, 2011.

[36] Y. Yuan, G. Wang, H. Wang, and L. Chen. Efficient SubgraphSearch over Large Uncertain Graphs. PVLDB, 4(11):876–886, 2011.

14

[37] R. Zhu, Z. Zou, and J. Li. Top-k Reliability Search on UncertainGraphs. In ICDM, 2015.

APPENDIX

Proof of Theorem 2. A problem is said to admit a PolynomialTime Approximation Scheme (PTAS) if the problem admitsa polynomial-time constant-factor approximation algorithmfor every constant β ∈ (0, 1). We prove the theorem by show-ing that there exists at least one value of β such that, if aβ-approximation algorithm for s-t TOP-k CATALYSTS exists,then we can solve the well-known SET COVER problem inpolynomial time. Since SET COVER is an NP-hard problem,clearly this can happen only if P = NP.

In SET COVER we are given a universe U , and a set of hsubsets of U , i.e., S = {S1, S2, . . . , Sh}, where Si ⊆ U , forall i ∈ [1 . . . h]. The decision version of SET COVER asks thefollowing question: given k, is there any a solution with nomore than k sets that cover the whole universe?

Given an instance of SET COVER, we construct in polyno-mial time an instance of our s-t TOP-k CATALYSTS problemin the same way as in Theorem 1. On this instance, if k setssuffice to cover the whole universe in the original instance ofSET COVER, the optimal solution C∗ would have reliabilityat most [1 − (1 − p2)Z ], where Z = |U | (because at mostZ disjoint paths from s to t would be produced, each withexistence probability p2). On the other hand, if no k setscover the whole universe, C∗ would have reliability at most[1 − (1 − p2)Z−1] (because at least one of the disjoint pathswould be discarded).

Now, assume that a polynomial-time β-approximationalgorithm for s-t TOP-k CATALYSTS exists, for some β ∈(0, 1). Call it “Approx”. Approx would yield a solution C2

such that R ((s, t)|C2) ≥ βR ((s, t)|C∗). Now, consider theinequality [1−(1−p2)Z−1] < β[1−(1−p2)Z ]. If this inequal-ity has solution for some values of β and p, then by simplyrunning Approx on the instance of s-t TOP-k CATALYSTS

constructed this way, and checking the reliability of thesolution returned by Approx, one can answer SET COVER

in polynomial time: a solution to SET COVER exists iff thesolution given by Approx has reliability ≥ β[1 − (1 − p2)Z ].Thus, to prove the theorem we need to show that a solutionto that inequality exists.

To this end, consider the real-valued function f(p, Z) =1−(1−p2)Z−1

1−(1−p2)Z . Our inequality has a solution iff β > f(p, Z). It

is easy to see that f(p, Z) < 1, for all Z ≥ 1 and p > 0. Thismeans that there will always be a value of β ∈ (0, 1) and pfor which β > f(p, Z) is satisfied, regardless of Z . Hence,there exists at least one value of β such that the inequality[1 − (1 − p2)Z−1] < β[1 − (1 − p2)Z ] has solution, and,based on the above argument, such that no β-approximationalgorithm for Problem 1 can exist. The theorem follows.

Proof of Theorem 4. If both our objective function and theconstraint were proved to be submodular, our iterative pathinclusion problem (Problem 2) would become an instance ofthe Sub-modular Cost Sub-modular Knapsack (SCSK) problem[21], and the approximation result in Theorem 4 would eas-ily follow from [21]. In the following we show that indeedboth our objective function (Lemma 1) and our constraints(Lemma 2) are submodular, thus also proving Theorem 4.

Lemma 1. The constraint of the iterative path inclusion problem,i.e., total number of catalysts on edges of the included paths issub-modular with respect to inclusion of paths.Proof. Consider two path sets P1, P2 from s to t such thatP2 ⊇ P1. Also, we assume a path P from s to t, where P 6∈P2. There can be two distinct cases: (a) P has no commoncatalyst with the paths in P2 \ P1. (b) P has at least onecommon catalyst with the paths in P2 \ P1. In the first case,

∣

∣

∣

∣

∣

∪e∈P1∪{P}

C(e)

∣

∣

∣

∣

∣

−

∣

∣

∣

∣

∣

∪e∈P1

C(e)

∣

∣

∣

∣

∣

=

∣

∣

∣

∣

∣

∪e∈P2∪{P}

C(e)

∣

∣

∣

∣

∣

−

∣

∣

∣

∣

∣

∪e∈P2

C(e)

∣

∣

∣

∣

∣

(12)

In the second case,∣

∣

∣

∣

∣

∪e∈P1∪{P}

C(e)

∣

∣

∣

∣

∣

−

∣

∣

∣

∣

∣

∪e∈P1

C(e)

∣

∣

∣

∣

∣

<

∣

∣

∣

∣

∣

∪e∈P2∪{P}

C(e)

∣

∣

∣

∣

∣

−

∣

∣

∣

∣

∣

∪e∈P2

C(e)

∣

∣

∣

∣

∣

(13)

Hence, the result follows.

Lemma 2. If the top-r most reliable paths are node-disjoint(except at source and target nodes), then the objective function ofthe iterative path selection problem (Problem 2), i.e., RelP1(s, t)is sub-modular with respect to inclusion of paths.

Proof. Assume P1,P2 ⊂ P , such that P1 ⊆ P2. Alsoconsider a path P ∈ P and P 6∈ P2. Let us denoteby RelP1(s, t) = p1, RelP1∪{P}(s, t) = p1 + δ, andRelP2\P1

(s, t) = p2. Due to our assumption that the top-r most reliable paths in P are node-disjoint except at thesource and the target, we have: RelP2(s, t) = 1 − (1 −p1)(1− p2), and RelP2∪{P}(s, t) = 1− (1− p1 − δ)(1− p2).Hence, RelP2∪{P}(s, t) − RelP2(s, t) = (1 − p2)δ. This issmaller than or equal to δ, which was the marginal gain forincluding the path P in the set P1. Therefore, our objectivefunction is sub-modular.

Arijit Khan is an Assistant Professor at NanyangTechnological University, Singapore. He earnedhis PhD from the University of California, SantaBarbara, and did a post-doc in the Systemsgroup at ETH Zurich. Arijit is the recipient ofthe IBM PhD Fellowship in 2012-13. He co-presented tutorials on graph queries and sys-tems at ICDE 2012, VLDB 2014, 2015, 2017.

Francesco Bonchi is a Research Leader at theISI Foundation, Turin, Italy. Before he was Di-rector of Research at Yahoo Labs in Barcelona,Spain. He is an Associate Editor of the IEEETransactions on Knowledge and Data Engineer-ing (TKDE), and the ACM Transactions on Intelli-gent Systems and Technology (TIST). Dr. Bonchihas served as program co-chair of HT 2017,ICDM 2016, and ECML PKDD 2010.

Francesco Gullo is a research scientist atUniCredit,, R&D department. He received hisPh.D. from the University of Calabria (Italy)in 2010. He spent four years at Yahoo Labs,Barcelona, first as a postdoctoral researcher,and then as a research scientist. He servedas workshop chair of ICDM’16, and organizedseveral workshops/symposia (MIDAS @ECML-PKDD’16, MultiClust @SDM’14, @KDD’13).

Andreas Nufer is a Senior Consultant at Grid-Soft AG in Switzerland. He completed his mas-ters in Computer Science at ETH Zurich, and didhis bachelors at Ecole Superieure en SciencesInformatiques in France, and also from Bern Uni-versity of Applied Science in Switzerland.

Conditional Reliability in Uncertain GraphsarXiv:1608.04474v3 [cs.SI] 29 Apr 2018 1 Conditional Reliability in Uncertain Graphs Arjiit Khan, Francesco Bonchi, Francesco Gullo, and

Documents