Distance-Constraint Reachability Computation in
Uncertain Graphs
Ruoming Jin, Lin Liu Kent State University
Bolin Ding UIUC
Haixun WangMSRA
Why Uncertain Graphs?
Protein-Protein Interaction NetworksFalse Positive > 45%
Social NetworksProbabilistic Trust/Influence Model
Uncertainty is ubiquitous!
Increasing importance of graph/network data Social Network, Biological Network,
Traffic/Transportation Network, Peer-to-Peer Network
Probabilistic perspective gets more and more attention recently.
Uncertain Graph Modela
0 .5
0 .7
0 .2 0 .6
0 .5
0 .90 .4
0 .1
0 .3
s
c
b t
Existence Probability
Edge Independence
• Possible worlds (2#Edge)a
s
c
b t
Weight of G2: Pr(G2) =
0.5 (1-0.5)
(1-0.3)
0.2 0.60.7
(1-0.1)(1-0.4) (1-0.9)
* * * ** * * *
G2:
= 0.0007938
G1:
a
s
c
b t
Distance-Constraint Reachability (DCR) Problem
a0 .5
0 .7
0 .2 0 .6
0 .5
0 .90 .4
0 .1
0 .3
s
c
b t
SourceTarget
• What is the probability that s can reach t within distance d?
• A generalization of the two-terminal network reliability problem, which has no distance constraint.
Given distance constraint d and two vertices s and t,
Important Applications
• Peer-to-Peer (P2P) Networks– Communication happens only when node
distance is limited.
• Social Networks– Trust/Influence can only be propagated only
through small number of hops.
• Traffic Networks– Travel distance (travel time) query– What is the probability that we can reach the airport
within one hour?
Example: Exact Computation
• d = 2, ?
a0 .5
0 .7
0 .2 0 .6
0 .5
0 .90 .4
0 .1
0 .3
s
c
b t
a
s
c
b t
a
s
c
b t
a
s
c
b t
a
s
c
b t
First Step: Enumerate all possible worlds (29),
Pr(G1) Pr(G2) Pr(G3) Pr(G4)
… +Pr(G1)* 0+Pr(G2) Pr(G3) Pr(G4)* 1 * 0 * 1+ + + …
Second Step: Check for distance-constraint connectivity,
=
Approximating Distance-Constraint Reachability
Computation• Hardness
– Two-terminal network reliability is #P-Complete.
– DCR is a generalization.• Our goal is to approximate through
Sampling– Unbiased estimator – Minimal variance– Low computational cost
Start from the most intuitive estimators,
right?
Direct Sampling Approach• Sampling Process
– Sample n graphs – Sample each graph according to edge
probabilitya
0 .5
0 .7
0 .2 0 .6
0 .5
0 .90 .4
0 .1
0 .3
s
c
b t
a
s
c
b t
Direct Sampling Approach (Cont’)
• Estimator
• Unbiased
• Variance
= 1, s reach t within d; = 0, otherwise.
)ˆ( BREIndicator function
Path-Based Approach• Generate Path Set
– Enumerate all paths from s to t with length ≤ d
– Enumeration methods• E.g., DFS
a
0 .7
0 .2 0 .6
0 .90 .4
0 .1
0 .3
s
c
b t
Path-Based Approach (Cont’)• Path set
• • Exactly computed by Inclusion-
Exclusion principle• Approximated by Monte-Carlo
Algorithm by R. M. Karp and M. G. Luby ( )
• Unbiased • Variance
Can we do better?
Divide-and-Conquer Methodology
• Example
a
s
c
b t
a
s
c
b t
a
s
c
b t
a
s
c
b t
a
s
c
b t
a
s
c
b t
a
s
c
b t
+(s,a) -(s,a)
+(a,t) -(a,t) +(s,b) -(s,b)
…
… …a
s
c
b t
…
…
…
1. # of leaf nodes is smaller than 2|E| .
2. Each possible world exists only in one leaf node.
3. Reachability is the sum of the weights of blue nodes.
4. Leaf nodes form a nice sample space.
… ... … ...
…
Divide and Conquer (Cont’)all possible worlds
Graphs having e1
Graphs not Having e1
s can reach t.
s can not reach t.
Summarize:
… ... … ...
…
How do we sample?
• Unequal probability sampling– Hansen-Hurwitz (HH) estimator– Horvitz-Thomson (HT) estimator
Sample Unit
Start from herePri: Sample Unit Weight; Sum of possible worlds’ probabilities in the node.
qi: sampling probability, determined by properties of coins along the way.
Hansen-Hurwitz (HH) Estimator
• Estimator• Unbiased• Variance
To minimize the variance above, we have :Pri = qi
= 1, blue node
= 0, red node
… ... … ...
…
Pri = p(e1)*p(e2)*(1-p(e3))*…P(e1) 1-P(e1)
P(e2)
P(e3)1-P(e3)
1-P(e2)p(e1) : 1 – p(e1)
p(e2) : 1 – p(e2)
p(e3) : 1 – p(e3)
P(e4)1-P(e4)
Weight Sampling probability
sample size
Pri: the leaf node weight
qi: the sampling probability
Horvitz-Thomson (HT) Estimator
• Estimator
• Unbiased• Variance
– To minimize vairance, we findPri = qi
– Smaller variance than HH estimator
# of Unique sample units
Can we further reduce the variance and computational
cost?
… ... … ...
…
Recursive Estimator
1.Unbiased
2.Variance:Sample the entire space n timesSample the sub-
space n1 timesSample the sub-space n2 times We can not
minimize the variance without knowing τ1 and τ2. Then what can we do?
n1 + n2 = n
Sample Allocation
• We guess: What if– n1 = n*p(e)
– n2 = n*(1-p(e))?
• We find: Variance reduced!– HH Estimator:
– HT Estimator:
Sample Allocation (Cont’)• Sampling Time Reduced!!
Directly allocate samples
Toss coin when sample size is small
Sample size = n
n1=n*p(e1) n2=n*(1-p(e1))
n3=n1*p(e2)
n4=n1*(1-p(e2))
Experimental Setup• Experiment setting
– Goal:• Relative Error• Variance• Computational Time
– System Specification• 2.0GHz Dual Core AMD Opteron CPU• 4.0GB RAM• Linux
Experimental Results• Synthetic datasets
– Erdös-Rényi random graphs– Vertex#: 5000, edge density: 10, Sample
size: 1000– Categorized by extracted-subgraph size
(#edge)– For each category, 1000 queries
Experimental Results
• Real datasets– DBLP: 226,000 vertices, 1,400,000
edges– Yeast PPIN: 5499 vertices, 63796 edges– Fly PPIN: 7518 vertices, 51660 edges– Extracted subgraphs size: 20 ~ 50
edges
Conclusions
• We first propose a novel s-t distance-constraint reachability problem in uncertain graphs.
• One efficient exact computation algorithm is developed based on a divide-and-conquer scheme.
• Compared with two classic reachability estimators, two significant unequal probability sampling estimators Hansen-Hurwitz (HH) estimator and Horvitz-Thomson (HT) estimator.
• Based on the enumeration tree framework, two recursive estimators Recursive HH, and Recursive HT are constructed to reduce estimation variance and time.
• Experiments demonstrate the accuracy and efficiency of our estimators.
Thank you !Questions?