Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23
Dec 26, 2015
Graph Indexing Techniques
Seoul National UniversityIDB Lab.
Kisung Kim2011. 3. 23
Outline
• Category of graph queries• Querying in collection DB• References
2/22
Category of Graph Queries: Matching Type
• Exact subgraph matching– Find graphs in DB which have all components of the query graph
• Similarity subgraph matching– Find graphs in DB which have some components of the query graph– Similarity measure is needed
• Super graph matching– Find graphs in DB which are contained in the query graph
Query graph Exact subgraph SimilaritySubgraph
Query graph
3/22
Category of Graph Queries: Target DB
• Collection DB: large number of small graphs– e.g. Chemical compounds– Retrieval component
– IDs of graphs which contain matching parts
• Large graphs: small number of large graphs– e.g. Social network, RDF graph– Retrieval component
– All matching subgraphs
G1
G2
G3
G4
G7
G6
G5
Query graph
G1, G3, G5
Results: graph ID list
Querying Collection DB
Query graph
Results: matching subgraphs
Querying Large Graphs
4/22
Query Processing in Collection DB
• Processing flow
• Verification uses usual pair-wise subgraph isomorphism algo-rithm
• Most of techniques focus on filtering techniques– The cost of verification is high– To reduce the number of verification execution
Query Filtering Candidategraph set Verification Answer
Graphs
5/22
Query Processing in Large Graphs
• Processing flow
• Focus on node indexing– To reduce search space– Use structural information of nodes
• Build subgraph by joining candidate nodes– Join methods are not relatively researched– Optimization using join ordering
QueryIndexsearch
Candidatenode sets
Building subgraphs
Answersubgraphs
6/22
Graph Indexing Techniques
Target Database Query Type
GraphGrep[Shasha et al., PODS’02]
Collection DB Exact Feature(Path) based index
gIndex[Yan et al., SIGMOD’04]
Collection DB Exact Feature(Graph) based index
Grafil[Yan et al., SIGMOD’05]
Collection DB Exact & Similarity Feature based similarity search
C-tree[He and Singh, ICDE’06]
Collection DB Exact & Similarity Closure based index
QuickSI[Shang et al., VLDB’08]
Collection DB Exact Verification algorithm
Tale[Tian and Patel, ICDE’08]
Collection DB Exact & SimilaritySimilarity search using node in-
dex
GraphQL[He and Singh, SIGMOD’08]
Large graphs Exact Node indexing
Spath[Zhao and Han, VLDB’10]
Large graphs ExactNode indexing using neighbor-
hood information
7/22
Outline
• Category of graph queries• Querying in collection DB• References
8/22
GraphGrep(1/2) [Shasha et al. PODS’02]
• First work adopts the filtering-and-verification framework• Path-based index
– Fingerprint of database– Enumerate the set of all paths(length <= L) of all graphs in DB– For each path, the number of occurrences in each graphs are stored in
hash table
B
A
C
B
B
A
C
B
D
E
C
A B
B
C
Key g1 g2 g3
h(CA) 1 0 1
…
h(ABCB) 2 2 0
g1 g2g3 Index
9/22
GraphGrep(2/2): Query Processing
• Filtering– Make the fingerprint of query q
– Hash all paths (length <= L) of q– Compare the fingerprint of the query with the fingerprint of database
– Discard a graph whose value in fingerprint is less than the value in query fin-gerprint
• Verification– Check subgraph isomorphism tests
Key g1 g2 g3
h(AB) 2 2 1
h(AC) 1 0 1
h(BAC) 2 0 1
B
A
C
B
B
A
C
B
D
E
C
A B
B
C
g1 g2g3
Index
B
A C
AB:1AC:1BAC:1
Query
Candidates= {g1, g3}
Verification
10/22
gIndex(1/6) [Yan et al., SIGMOD’04]
• Path-based approach has week points– Path is too simple: structural information is lost– There are too many paths: the set of paths in a graph database usually
is huge
• Solution– Use graph structure instead of path as the basic index feature
c c c c
c cc c
c c
c c
c c
c c
c c
c c
Sample Database
c
c c
c
c
c
Query
c c c
c c c
Paths in Query Graph
Cannot Filter Any GraphsIn Database
11/22
gIndex(2/6): Frequent Fragment
• The number of graph structure is largeIndex only frequent subgraphs
• support(g)– The number of graphs in D (graph database), where g is a subgraph
• minSup– Minimum support threshold– Index a fragment, g only if support(g) ≥ minSup
• Size-increasing support– Frequent fragments are increasing as the size of a fragment increases– Low minSup for small fragments, high minSup for large fragment
12/22
gIndex(3/6): Frequent Fragment
A A
B
A A
B B
A A
B B
A
A
B B
A A
A B
A A B
A B B
B A B
A B A
A B
B
A
A A
B
A
B B
B A
B
A
B A
B
A
B B
A
A A
B B
A
A
A
B B
Size=1 Size=2 Size=3 Size=4
F=3
F=4B B
F=3
F=3
F=3
F=2
F=2
F=2
F=1
F=1
F=1
F=1
F=2
F=1
F=1
minSup=1 minSup=1 minSup=2 minSup=2 13/22
gIndex(4/6): Discriminative Fragment
• Redundant fragment– Fragments whose indexed graphs are also indexed by its subgraphs– We don’t need to include redundant fragments
• Discriminative fragment– Fragments which are not redundant
A A
B
A A
B B
A A
B BA A B
A B B
A B
B
A
Size=2 Size=3
Df1={g1, g2, g3}
Df2={g2, g3, g4}Df3={g2, g3}=Df1∩Df2
f1
f2
f3
g1
g2
g3
A
A
B B
g4
14/22
a
gIndex(5/6): gIndex Tree
• Use graph serialization method – For fast graph isomorphism checking during index search– DFS coding [Yan et al. ICDM’02]– Translate a graph into a unique edge sequence
• gIndex Tree– Prefix tree which consists of the edge sequences of discriminative fragments– Record all size-n discriminative fragments in level n– Black nodes discriminative fragments
– Have ID lists: the ids of graphs containing f i
– White nodes redundant fragments; for Apriori pruning
X
X
Z Y
ba
ba
X
X
Z Y
b
ba
v0
v1
v2 v3
DFS Coding
<(v0,v1),(v1,v2),(v2,v0),(v1,v3)>
f1
f2
f3
e1
e2
e3
Level 0
Level 1
Level 2
…
gIndex Tree15/22
gIndex(6/6): Searching
• Searching process– Given a query q, enumerate all q’s fragments (size <= maxSize)– Locate the fragments in gIndex tree– Intersect the id lists associated with the fragments
• Apriori pruning– Generating every fragment is inefficient– If a fragment is not in gIndexTree, we need not check its super-graphs
any more– Redundant fragments need to be recorded for Apriori pruning
f1
f2
f3
e1
e2
e3
Level 0
Level 1
Level 2
…
gIndex Tree
Query<e1, e2, e3, e4, e5>
Fragments<e1><e1, e2><e1, e2, e3><e1, e2, e3, e4> stop<e2>…
16/22
Grafil(1/4) [Yan et al., SIGMOD’05]
• Subgraph similarity search• Feature-based approach• Similarity search using relaxed queries
– Relax a query by deletion of k edges– Missed edges incur missed features
• Main question– What is the maximum missed features() when relaxing a query with k
missed edges?
Feature Vector
G1 {u1, u2, …, un}
G2
…
Gn
Subgraph exact search
Subgraph similarity search
𝑓𝑜𝑟 1≤ 𝑖≤𝑛 ,𝑢𝑖≥𝑣 𝑖
{v1, v2, …, vn}
Query
17/22
Grafil(2/4): Feature Misses
Query
Relaxed Queries
Features
fa fb fc
fa fb fc
1 2 4
fa fb fc
1 0 3
fa fb fc
0 1 2
fa fb fc
0 1 2
Miss 1 edges =4
=3
=3
FeatureMiss
7-4=3
7-3=4
7-3=4
Maximum Feature Missesmmax=4
18/22
Grafil(3/4): Feature Miss Estimation
• Problem– Given a query Q and a set of features contained in Q, if the relaxation ra-
tio is given, what is the maximal number of features that can be missed?
• Use edge-feature matrix– Find the maximum number of columns that can be hit by k rows– K: the number of missing edges in Q
• Classic maximum coverage problem (set k-cover)– Proved NP-complete
Features
fa fb fc
Query
fa fb1 fb2 fc1 fc2 fc3 fc4
e1 0 1 1 1 0 0 0
e2 1 1 0 0 1 0 1
e3 1 0 1 0 0 1 1
Edge-Feature Matrix
e1
e2 e3
19/22
Grafil(4/4): Feature Conjugation
• Compensate the misses of a feature by occurrences of an-other features in G
• Using all the features together in one filter would deteriorate the filtering performance
• Solution– Use multiple filters– Feature set selection
Query Features
fafa fb
3 4
mmax=4
(3-0)+0=3 ≤ mmax
A
B
A AA A
C
BB B
fb
C
AA A
A A
C
Graph
Relaxation Ratio = 1
20/22
Graph Indexing Techniques
Target Database Query Type
GraphGrep[Shasha et al., PODS’02]
Collection DB Exact Feature(Path) based index
gIndex[Yan et al., SIGMOD’04]
Collection DB Exact Feature(Graph) based index
Grafil[Yan et al., SIGMOD’05]
Collection DB Exact & Similarity Feature based similarity search
C-tree[He and Singh, ICDE’06]
Collection DB Exact & Similarity Closure based index
QuickSI[Shang et al., VLDB’08]
Collection DB Exact Verification algorithm
Tale[Tian and Patel, ICDE’08]
Collection DB Exact & SimilaritySimilarity search using node in-
dex
GraphQL[He and Singh, SIGMOD’08]
Large graphs Exact Node indexing
Spath[Zhao and Han, VLDB’10]
Large graphs ExactNode indexing using neighbor-
hood information
21/22
References• [Shasha et al., PODS’02] Dennis Shasha, Jaso T. L. Wang, Rosalba Giugno, Algo-
rithmics and Applications of Tree and Graph Searching. PODS, 2002.• [Yan et al., SIGMOD’04] Xifeng Yan, Philip S. Yu, Jiawei Han, Graph Indexing: A
Frequent Structure-based Approach. SIGMOD, 2004.• [Yan et al., SIGMOD’05] Xifeng Yan, Philip S. Yu, Jiawei Han, Substructure Simi-
larity Search in Graph Databases. SIGMOD, 2005. • [Tian and Patel, ICDE’08] Yuanyuan Tian , Jignesh M. Patel. TALE: A Tool for Ap-
proximate Large Graph Matching. ICDE, 2008.• [He and Singh, SIGMOD’08] Huahai He, Ambuj K. Singh. Graphs-at-a-time: query
language and access methods for graph databases. SIGMOD, 2008.• [Zhao and Han, VLDB’10] Peiziang Zhao, Jiawei Han. On Graph Query Optimiza-
tion in Large Networks. VLDB, 2010.• [He and Singh, ICDE’06] Huahai He, Ambuj K. Singh, Closure-Tree: An Index
Structure for Graph Queries. ICDE, 2006• [Shang et al., VLDB’08] Haichuan Shang, Ying Zhang, Xuemin Lin, Jeffrey Xu Yu,
Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomor-phism. VLDB, 2008
22/22