Top Banner
1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther
28

1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

Dec 14, 2015

Download

Documents

Reagan Hailes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

1

gSpan: Graph-based substructure pattern

miningAuthors: Xifeng Yan and Jiawei Han

Presented by: Colin Luther

Page 2: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

Copyright note:

This presentation was originally provided by Prof. Xifeng Yan upon request from a student.

Citation:Xifeng Yan and Jiawei Han. gSpan:

graph- based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), 2002

2

Page 3: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

3

Outlines

Background Problem Definition Authors Contribution Concepts behind gSpan Experimental Result Conclusion

Page 4: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

4

Background

Frequent Subgraph Mining is an extension to existing frequent pattern mining algorithms

A major challenge is to count how many instances of a pattern are in the dataset

Counting instances might be easy for sets, but subtle for graphs

Recall the graph isomorphism problem

Page 5: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

Background

5

X W

U Y

V

(a)

X

W

U

YV

(b)

Two Isomorphic graph (a) and (b) with their mapping function (c)

Two graphs are isomorphic if one can find a mapping of nodes of the first graph to the second graph such that labels on nodes and edges are preserved.

f(V1.1) = V2.2f(V1.2) = V2.5f(V1.3) = V2.3f(V1.4) = V2.4f(V1.5) = V2.1

(c)

G1=(V1,E1,L1) G2=(V2,E2,L2)

1

2

3

4

51

2

34

5

Page 6: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

6

Problem: Finding Frequent Subgraphs

Problem setting: similar to finding frequent itemsets

for association rule discovery

Input: Database of graph transactions Undirected simple graph (no multiples edges)

Each graph transaction has labeled edges/vertices.

Transactions may not be connected

Minimum support thresholds

Output: Frequent subgraphs that satisfy the support

threshold, where each frequent subgraph is connected.

Page 7: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

Xifeng Yan 7

Finding Frequent Subgraphs

Page 8: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

8

Authors Contribution

Representing graphs as strings (like TreeMiner) No candidate generation! “It combines the growing and checking of frequent

subgraphs into one procedure, thus accelerates the mining process.”

Really fast, still a standard baseline system that most rivals compare their systems to.

Page 9: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

9

Concepts behind gSpan

The idea is to produces a Depth-First Search (DFS) codes for each edge in graphs

Edges are sorted according to lexicographic order of codes

Yan and Han proved that graph isomororphism can be tested for two graphs annotated with DFS codes

Starting with small graph patterns containing 1-edge, patterns are expanded systemically by the DFS search

Employ anti-monotonic property of graph frequency

Page 10: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

Anti-Monotonicity of graph frequency

10

The frequency of a super-pattern is less than or equal to the frequency of a sub-pattern. Copyright SIGMOD’08

Page 11: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

11

Lexicographic Ordering in Graph

It can tell us the order of two graphs. The design can help us build a similar hierarchy. The design should guarantee easy-growing from one

level to the lower level and easy-rolling-up from low level to higher level.

It may be difficult to have such design that no two nodes in this tree are same for graph case.

It can tell us whether the graph has been discovered. And more, the most important, if a graph has been

discovered, all its children nodes in the hierarchy must have been discovered.

Page 12: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

12

Lexicographic Ordering in Graph

...

... ...

1-edge

2-edge

...3-edge ...

...

...

...

Page 13: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

13

DFS code and Minimum DFS code

Depth First Tree and Forward/Backward Edge Set

Page 14: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

14

DFS code and Minimum DFS code

We use a 5-tuple (vi, vj, l(vi), l(vj), l(vi,vj)) to represent an edge. (it may be redudant, but much easier to understand.)

Turn a graph into a sequence whose basic element is 5-tuple. Form the sequence in such an order: to extend one new node, add the forward

edge that connect one node in the old graph with this new node.

Add all backward edge that connect this new node to other nodes in the old graph

repeat this procedure.

Page 15: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

15

DFS code

X

Y

X

Z

Z

a a

b

bc

d

v0v1v2

v3v4

X

Ya

e0: (0,1,x,y,a)

Xb

e1: (1,2,y,x,b)a

e2: (2,0,x,x,a)

Zc e3: (2,3,x,z,c)b

e4: (3,1,x,y,b)

Zd

e5: (1,4,x,z,d)

Page 16: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

16

Minimum DFS code

Each Graph may have lots of DFS code (why?):one smallest lexicographic one is its Minimum DFS Code Edge no. (B) (C) (D)

0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a)

1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b)

2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a)

3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a)

4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c)

5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)

Page 17: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

17

Graph Parent and its Children

X

Y

X

ZZ

a

b

ca

Given a DFS code c0=(e0,e1,…,en)if c1=(e0,e1,…,en,ex)if c0<c1, then c0 is c1’s parent,c1 is c0’s child.

?

?

?

?

??

?

?

Page 18: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

18

DFS Code Tree

...

... ...

1-edge

2-edge

...3-edge ...

...

...

...

Page 19: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

19

Theorem

1. Given two graph G0 and G1, G0 is isomorphic to G1 iff min_dfs_code(G0)=min_dfs_code(G1).

2. DFS Code Tree covers all graphs although some tree nodes may represent the same graph

3. Given a node in DFS Code Tree, if its DFS code is not its minimum DFS code, prune this node and its all descendants won’t change. “Covering”.

Page 20: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

20

Algorithm

Page 21: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

21

Algorithm

Page 22: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

22

Experimental Result

Page 23: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

23

Experimental Result

Page 24: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

24

Conclusion

No Candidate Generation and False Test Space Saving from Depth First Search Good Performance: using “memory Pool”

and one major counting improvement, it seems the performance will be improved 5 times more. (but need more testing).

Page 25: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

25

Exam Questions

Q1) What two major costs from Apriori-like, frequent substructure mining algorithms did gSpan aim to reduce/avoid?

Answer:

1) The creation of size k+1 candidate subgraphs from size k frequent subgraphs is more complicated and costly the standard

Apriori large itemset generation.

2) Pruning false positives is an expensive process. Subgraph isomorphism problem is NP-Complete.

Page 26: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

26

Exam Questions (cont.)

Q2) Which DFS tree does the DFS code below belong to?

Answer: tree (c)

Page 27: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

27

Exam Questions

Q3) What does gSpan compare when testing for isomorphism between two graphs, and why?

Answer: gSpan compares the minimum DFS codes of the two graphs. Given two graphs G and G’, G is isomorphic to G’ if min(G)=min(G’). This theorem allows for a simple string comparison of more complicated graphs. If two nodes contain the same graph but different minimum DFS codes, we can prune the sub-branch of the rightmost of the two nodes. This greatly decreases the problem size.

Page 28: 1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther.

28

Questions?