Top Banner
Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)
32

Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Efficiently Answering Reachability Queries on Large Directed Graphs

Ruoming JinKent State University

Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)

Page 2: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Reachability Query

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

?Query(1,11)

Yes

?Query(3,9)

No

The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ?

Directed Graph DAG (directed acyclic graph) by coalescing the strongly connected components

Page 3: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Applications

• XML

• Biological networks

• Ontology

• Knowledge representation (Lattice operation)

• Object programming (Class relationship)

• Distributed systems (Reachable states)

Graph Databases

Page 4: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Method Query time Construction Index size

DFS/BFS O(n+m) O(n+m) O(n+m)

Transitive Closure O(1) O(nm)/O(n3) O(n2)

Optimal Chain Cover

(Jagadish, TODS’90)O(k) O(nm) O(nk)

Optimal Tree Cover

(Agrawal et al., SIGMOD’89)O(n) O(nm) O(n2)

Dual-Labeling

(Wang et al., ICDE’06)O(1) O(n+m+t3) O(n+t2)

Labeling+SSPI

(Chen et al., VLDB’05)O(m-n) O(n+m) O(n+m)

GRIPP

(Triβl et al., SIGMOD’07)O(m-n) O(n+m) O(n+m)

Prior Work

2-HOP (O(nm1/2), and O(n4)), HOPI, and heuristic algorithms

Page 5: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Limitation of Tree-based approaches

• Finding a good tree cover is expensive

• Tree cover cannot represent some common types of DAGs, like Grid

• Compression limitations– Chain (1-parent, 1-child)– Tree (1-parent, multiple children) – Most existing methods which utilize the tree

cover are greatly affected by how many edges are left uncovered

Page 6: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Overview of Path-Tree

• Chain->Tree->Path-Tree (2 parents / multiple children)

• Path-tree cover is a spanning subgraph of G in a tree shape (T)

• A node in the tree T corresponds to a path in G and an edge in T corresponds to the edges between two paths in G

• 3-tuple labeling exists for any path-tree to answer reachability query in O(1)

Page 7: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Path-Tree in a Nutshell

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4 P1

P2

P3

P4

Path-Graph is not necessarily a planar graphThe reachability between any two nodes can be answered in O(1)

Page 8: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Key Problems

• How to construct a path-tree?– Algorithm

• How can a path-tree help with reachability queries?– Labeling – Transitive Closure Compression

• How does path-tree compare with the existing methods?– Optimality

Page 9: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Constructing Path-Tree

• Step 1: Path-Decomposition of DAG

• Step 2: Minimal Equivalent Edge Set between any two paths

• Step 3: Path-Graph Construction

• Step 4: Path-Tree Cover Extraction

Page 10: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Step 1: Path-Decomposition

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

(PID,SID)=(2, 5)

For any two nodes (u, v) in the same path, u v if and only if (u.sid v.sid)

Simple linear algorithm based on topological sort can achieve a path-decomposition

Page 11: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Step 2: Minimal equivalent edge set

1 2

3 4

6 7

13 10

11

14

15

P1 P2

P1 P2

The reachability between any two paths can be captured by a unique minimal set of edges

1 2

3 4

6 7

13 10

11

14

15

P1 P2

P1 P2

The edges in the minimal equivalent edge set do not cross (always parallel)!

Page 12: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Step 3: Path-Graph Construction

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

4

5

12 2

11

2

Weighted Directed Path-Graph

Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge

Page 13: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Step 4: Extracting Path-Tree Cover

P1

P2

P3

P4

4

5

12 2

11

2

Weighted Directed Path-Graph

P1

P2

P3

P4

5

2

2

Maximal Directed Spanning Tree

Chu-Liu/Edmonds algorithm, O(m’+ k logk)

Page 14: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Key Problems

• How to construct a path-tree?– Algorithm

• How can path-tree help with reachability queries?– Labeling – Transitive Closure Compression

• How does path-tree compare with the existing methods?– Optimality

Page 15: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

3-Tuple Labeling for Reachability

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

DFS labeling (1-tuple)

Interval labeling (2-tuple)High-level description about pathsPi Pj ?

[1,1]

[2,2]

[1,3]

[1,4]

Page 16: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

DFS labeling

1

2

3

4

6

7

85

9

13

10 11

12

14 15P1

P2

P3

P4

1. Starting from the first vertex in the root-path 2. Always try to visit the next vertex in the same path3. Label a node when all its neighbors has been visited L(v)=N-x, x is the # of nodes has been labeled

1514

13

12

11

109

8

7

6

5

4

3

21

Page 17: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

3-Tuple Labeling for Reachability

1

2

3

4

6

7

85

9

13

10 11

12

14 15P1

P2

P4

1514

13

12

11

109

8

7

6

5

4

3

21

P1

P2

P3

P4

[1,1]

[2,2]

[1,3]

[1,4]

uv if and only if 1) Interval label I(u) I(v) 2) DFS label L(u) L(v)

?Query(9,15)P4[1,4] P1[1,1] and 5 < 15Yes?Query(9,2)?Query(5,9)

P3

Page 18: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Transitive Closure Compression

An efficient procedure can compute and compress the transitive closure in O(mk), k is number of paths in path-tree

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

Path-tree cover (including labeling)

can be constructed in O(m + n logn)

Page 19: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Key Problems

• How to construct a path-tree?– Algorithm

• How can path-tree help with reachability query?– Labeling – Transitive Closure Compression

• How does path-tree compare with the existing methods?– Optimality

Page 20: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Theoretical Analysis

• Optimal Path-Tree Cover (OPTC) Problem: – Given a path-decomposition, what is the optimal path-

tree cover to maximally compress the transitive closure?

– OptIndex weight assignment based on computing the predecessor set

• Optimal Path-Decomposition (OPD) Problem:– Assuming we only use path-decomposition to

compress the transitive closure, what is the optimal path-decomposition to maximally compress the transitive closure?

– Minimal-cost flow problem– What is the overall optimal path-decomposition?

Page 21: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Superiority of Path-Tree Cover

• The optimal tree cover is a special case of path-tree cover when each vertex corresponds to a single path and the weight is based on OptIndex.

• The path-tree cover approach can compress the transitive closure with size being smaller than or equal to the optimal tree cover approach (and consequently optimal chain cover approach).

Page 22: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Experimental Evaluation

• Implementation in C++

• 12 Real datasets used in Dual-labeling paper and GRIPP paper

• Synthetic datasets – Sparse DAG with edge density = 2

• AMD Opteron 2.0GHz/ 2GB/ Linux

• PTree1 (OptIndex) and PTree2 – Mainly compare with Optimal Tree Cover

Page 23: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Real Datasets

Graph Name #V #E DAG #V DAG #E

AgroCyc 13969 17694 12684 13408

aMaze 11877 28700 3710 3600

Anthra 13736 17307 12499 13104

Ecoo157 13800 17308 12620 13350

HpyCyc 5565 8474 4771 5859

Human 40051 43879 38811 39576

Kegg 14271 35170 3617 3908

Mtbrv 10697 13922 9602 10245

Nasa 5704 7942 5605 7735

Reactome 3678 14447 901 846

Vchocyc 10694 14207 9491 10143

Xmark 6483 7654 6080 7028

Page 24: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Experimental Result (Real Data)

 Transitive Closure Size Construction Time (in ms) Query Time (in ms)

Tree Ptree-1 Ptree-2 Tree Ptree-1 Ptree-2 Tree Ptree-1 Ptree-2

AgroCyc 13550 962 2133 149.8 224.853 142.311 46.629 10 14.393

aMaze 5178 1571 17274 1062.2 834.697 63.748 19.478 21.529 61.925

Anthra 13155 733 2620 141.11 212.258 143.568 44.958 9.317 16.498

Ecoo157 13493 973 3592 151.46 229.29 141.951 46.674 11.224 16.739

HpyCyc 5946 4224 4661 57.378 106.552 71.675 31.539 12.089 15.503

Human 39636 965 2910 446.32 648.005 465.148 70.107 20.008 23.008

Kegg 5121 1703 30344 746.03 1057.11 86.396 17.509 27.282 75.448

Mtbrv 10288 812 3664 111.48 173.382 106.583 40.391 9.81 19.815

Nasa 9162 5063 6670 85.291 111.397 53.139 37.037 16.214 20.771

Reactome 1293 383 1069 17.244 18.189 6.3 17.565 6.467 13.037

Vchocyc 10183 830 2262 109.47 170.714 103.036 40.026 8.999 14.274

Xmark 8237 2356 10614 204.76 247.628 68.358 37.834 17.122 41.549

On average 10 times better than Tree On average 3 times better than Tree

Page 25: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Experimental Result (Synthetic Data)

Transitive Closure Size

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

10 20 30 40 50 60 70 80 90 100

# of Vertices in K (DAG)

# o

f V

erti

ces

(TC

)

Tree

Ptree-1

Ptree-2

Page 26: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Experimental Result (Synthetic Data)

Construction Time

0

200

400

600

800

1000

1200

1400

1600

10 20 30 40 50 60 70 80 90 100

# of Vertices in K (DAG)

Co

nst

ruct

ion

Tim

e in

ms

Tree

Ptree-1

Ptree-2

Page 27: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Experimental Result (Synthetic Data)Query Time

0

10

20

30

40

50

60

70

80

90

10 20 30 40 50 60 70 80 90 100

# of Vertices in K (DAG)

Que

ry T

ime

in m

s

Tree

Ptree-1

Ptree-2

Page 28: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Conclusion

• A novel Path-Tree structure is proposed to assist the compression of transitive closure and answering reachability query

• Path-tree has potential to integrate with other existing methods to further improve the efficiency of reachability query processing

Page 29: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Thanks!!

Page 30: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Step 3: Path-Graph Construction

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

4

5

12 2

11

2

Weighted Directed Path-Graph

Weight reflects the penalty if we exclude this path-tree edge

Page 31: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Step 2: Constructing Minimal Equivalent Edge Set (PiPj)

1 2

3 4

6 7

13 10

11

14

15

P1 P2

P1 P2

1. Ordering the vertices in Pi and Pj by decreasing order

2. Finding the first vertex v in P_j that P_i can reach3. Finding the last vertex u in P_i that reach v 4. Removing all the edges cross (u,v) and repeat 2-4

Page 32: Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

3-Tuple Labeling for Reachability

1 2

3 4

6 7 8

5

9

13 10

11

12

14

15

P1 P2

P3

P4

P1

P2

P3

P4

DFS labeling (1-tuple)

Interval labeling (2-tuple)High-level description about pathsPi Pj ?

[1,1]

[2,2]

[1,3]

[1,4]