1 Mining Tree Queries in a Graph Bart Goethals , Eveline Hoekx a nd Jan Van den Bussche KDD’05 presentor: Ming Jing Tsai
1
Mining Tree Queries in a Graph
Bart Goethals , Eveline Hoekx and Jan Van den Bussche
KDD’05presentor: Ming Jing Tsai
2
Introduction
mining tree pattern T in a single graph Incremental in the number of nodes Unordered, rooted
For each tree T, all conjunctive queries are generated
SQL
3
Tree query pattern example
Selected node(constant):0,8 Existential node:∃ Distinguished node: x
4
matching A query Q matchs in a graph G Homomorphism h
(i,j) ∈ Q , (h(i), h(j)) ∈ G Verify value on x to distinguish them
Don’t care existential nodes on different values
5
∃0 8
Q
G
Frequency = 3(4,5,8)
6
Generate all trees
Increasing number of nodes Canonically ordered
Level sequence ith number is the depth of the ith node in preord
er Lexicagraph:Maximal one
Level sequence 012212 > 012122
7
queries
Levelwise Fix a tree T, and find all queries based o
n T whose frequency in G is at lease k Q{∏, ∑, λ}
∏: existential nodes ∑: selected nodes λ: label of selected nodes
8
9
To generate candidate in an efficient manner,using of candidacy tables and frequency tables
10
CanTab ∏, ∑
parents
Each candidacy table can be computed by taking the natural join of its parent’s(∏’, ∑’) frequency tables
CanTabφ,{x} as the table with a single column x,holding all nodes of the graph G being mined
11
∏=x2,formulate expression->SQL
∑={x1,x3} Candidacy table
Frequency table
12
Equivalent queries
To avoid query Q2 equivalent to an earlier query Q1
Containment mapping Q1 to Q2 is a homomorphism the distinguished variables of Q1 is mapping
one-to-one to those of Q2 So as selected nodes
Case1:Q1 has fewer nodes than Q2 Case2:Q1 and Q2 have the same number
of nodes
13
Case1 redundancy checking
Q2 contains redundant subtrees such that removing them yields an equivalent query
Redundancy a subtree C in the form of a linear chain of exist
ential nodes such that parent of C has another subtree that is at least as deep as C
Q1Q2Q2
14
Case 2 canonical forms
Q1 and Q2 are tree isomorphism Canonical forms
Existential nodes-> ∃ Selceted nodes ->c Distinguished nodes->X
C, ∃
∃,C
∃,X
C,X
X,C
X,X
C, ∃
∃,C
∃,X
C,X
X,C
X,X
15
experiment
Pentium4 2.8GHz 1GB main memory Linux 2.6 C++ embedded SQL Relational database:DB2 UDB v8.2
16
Real dataset
A food web, a protein intersactions graph, and a citation graph
k: frequency threshold Size: maximal size of trees in the run It all takes several hours
17
Food web
154 species dependent on Scotch Broom Label 20 occurs in many frequent patterns->
Orthotylus adenocarpi( 什麼都吃的植物害蟲 )
Frequency 176
18
Protein interaction graph
1870 種 Saccharomyces cerevisiae 發酵酵母菌 ( 幫助麵包發酵 )
A small number of highly connected nodes occur
19
Citation graph
Kdd cup 2003 2500 papers high-energy physics 350,000 cross-references
Frequency 1655
20
Synthetic data,web graphs Tree size 5 Minsup 4,10,25
21
Uniform random graphs
Dense, uniform minsup: 10,25 edges:47,264,997