7/31/2019 Co so du lieu do thi
1/46
July 22, 2010 1
Mining, I ndexing and Searching
Graph Databases
Presenter: A/ Prof. Do PhucSource: Jiawei Han , Vladimir Lipets
7/31/2019 Co so du lieu do thi
2/46
July 22, 2010 2
Graph, Graph, Everyw here
A s p i r i n Yeast prot ein int eract ion net w ork
f r o m
H .
J e o n g e
t a
l N a
t u r e
4 1 1
, 4 1 ( 2 0 0 1 )
An I n t e r n et W eb Co-author net w ork
7/31/2019 Co so du lieu do thi
3/46
July 22, 2010 3
Why Graph Mining and Searching?
Graphs are ubiquitous
Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks (Bioinformactics)Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis
Graph is a general model
Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphsDirected vs. undirected, labeled vs. unlabeled (edges & vertices),weighted, with angles & geometry (topological vs. 2-D/3-D)
Complexity of algorithms: many problems are of high complexity!
7/31/2019 Co so du lieu do thi
4/46
7/31/2019 Co so du lieu do thi
5/46July 22, 2010 5
Motivation
Graph, Subgraph isomorphism is important andvery general form of pattern matching that findspractical application in areas such as:
pattern recognition and computer vision,
image processing,computer-aided design, graph grammars,graph transformation,biocomputing,search operation in chemical database,
numerous others.
7/31/2019 Co so du lieu do thi
6/46July 22, 2010 6
A hierarchy of pat t ern m at ching problems
Graph isomorphismSubgraph isomorphismMaximum common subgraph
Approximate subgraph isomorphism
Graph edit distance
7/31/2019 Co so du lieu do thi
7/46July 22, 2010 7
I somorphic Graphs
7/31/2019 Co so du lieu do thi
8/46July 22, 2010 8
Graph Isomorphism
7/31/2019 Co so du lieu do thi
9/46July 22, 2010 9
Subgraph of a given graph
7/31/2019 Co so du lieu do thi
10/46
S b h I hi d R l d
7/31/2019 Co so du lieu do thi
11/46July 22, 2010 11
Subgraph I som orphism and Relat edProblems
Given a pattern graph G and a target graph HDecision problem: Answer whether H contains asubgraph isomorphic to GSearch problem: Return an occurrence of G as a
subgraph of HCounting problem: Return a count of the numberof subgraphs of H that are isomorphic to GEnumeration problem: Return all occurrences of G as a subgraph of H
7/31/2019 Co so du lieu do thi
12/46July 22, 2010 12
Outline
Graph Isomorphism, Subgraph Isomorphism
Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
7/31/2019 Co so du lieu do thi
13/46
July 22, 2010 13
Graph Pat t ern Mining
Frequent subgraphs
A (sub)graph is frequent if its support (occurrencefrequency) in a given dataset is no less than aminimum support threshold
Applications of graph pattern miningMining biochemical structures
Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classification, clustering,
comparison, and correlation analysis
7/31/2019 Co so du lieu do thi
14/46
July 22, 2010 14
Example: Frequent Subgraphs
S
OH
O
O
O
N
O
N
HO
ON
O
N
(A) (B) (C)
ON
Graph Dataset
Frequent Patterns(min support is 2)
N
O
N
(1) (2)
7/31/2019 Co so du lieu do thi
15/46
July 22, 2010 15
Frequent Subgraph Mining Approaches
Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD00)
FSG: Kuramochi and Karypis (ICDM01)
PATH: Vanetik and Gudes (ICDM02, ICDM04)
FFSM: Huan, et al. (ICDM03)
Pattern growth-based approach
MoFa, Borgelt and Berthold (ICDM02)gSpan: Yan and Han (ICDM02)
Gaston: Nijssen and Kok (KDD04)
7/31/2019 Co so du lieu do thi
16/46
July 22, 2010 16
Proper t ies of Graph Mining Algor it hm s
Search orderbreadth vs. depth
Generation of candidate subgraphsapriori vs. pattern growth
Elimination of duplicate subgraphspassive vs. active
Support calculation
embedding store or notDiscover order of patterns
path
tree
graph
7/31/2019 Co so du lieu do thi
17/46
7/31/2019 Co so du lieu do thi
18/46
July 22, 2010 18
Graph Search: Querying Graph Dat abases
Querying graph databases:
Given a graph database and a query graph,find all graphs containing this query graph
NN
O H O N
O
N
O H
O
N N + N H
N
O N H O
N
N
S
O H
S
H O O
O N
N
O
O
query graph graph database
7/31/2019 Co so du lieu do thi
19/46
July 22, 2010 19
Scalabil it y I ssue
Sequential scanDisk I/O
Subgraph isomorphismtesting
An indexing mechanism isneeded
DayLight: Daylight.com(commercial)GraphGrep: Dennis Shasha,et al. PODS'02
Grace: Srinath Srinivasa, etal. ICDE'03
Sample database
OHO
N
N +
NH
N
O
N
HO
N
N
S
OH
S
HOO
O
N
N
O
O
OH
ON
O
N
(a) (b) (c)
NN
Query graph
7/31/2019 Co so du lieu do thi
20/46
July 22, 2010 20
I ndexing St rat egy
Graph (G)
Substructure
Query graph (Q)
If graph G contains querygraph Q, G should containany substructure of Q
RemarksIndex substructures of a query graph toprune graphs that do not contain these
substructures
7/31/2019 Co so du lieu do thi
21/46
7/31/2019 Co so du lieu do thi
22/46
July 22, 2010 22
Outline
Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Some recent progress on graph mining
7/31/2019 Co so du lieu do thi
23/46
July 22, 2010 23
Graph Clust er ing
Graph similarity measure
Feature-based similarity measureEach graph is represented as a feature vector
The similarity is defined by the distance of their
corresponding vectorsFrequent subgraphs can be used as features
Structure-based similarity measureMaximal common subgraph
Graph edit distance: insertion, deletion, and relabel
Graph alignment distance
7/31/2019 Co so du lieu do thi
24/46
July 22, 2010 24
Graph Classif icat ion
Local structure based approachLocal structures in a graph, e.g., neighbors
surrounding a vertex, paths with fixed lengthGraph pattern-based approach
Subgraph patterns from domain knowledgeSubgraph patterns from data miningKernel-based approach
Random walk (Grtner 02, Kashima et al. 02,ICML03, Mah et al. ICML04)
Optimal local assignment (Frhlich et al.ICML05
7/31/2019 Co so du lieu do thi
25/46
July 22, 2010 25
St ruct ure Sim ilar it y Search
(a) caffeine (b) diurobromine (c) viagra
CHEMICAL COMPOUNDS
QUERY GRAPH
7/31/2019 Co so du lieu do thi
26/46
July 22, 2010 26
Some St raight forw ard Met hods
Method1: Directly compute the similarity between the
graphs in the DB and the query graph
Sequential scan
Subgraph similarity computation
Method 2: Form a set of subgraph queries from the
original query graph and use the exact subgraph
search
Costly: If we allow 3 edges to be missed in a 20-
edge query graph, it may generate 1,140 subgraphs
7/31/2019 Co so du lieu do thi
27/46
July 22, 2010 27
I ndex: Precise vs. Approxim at e Search
Precise SearchUse frequent patterns as indexing features
Select features in the dat abase space based on theirselectivityBuild the index
Approximate SearchHard to build indices covering similar subgraphs
explosive number of subgraphs in databasesIdea: (1) keep the index structure(2) select features in the query space
7/31/2019 Co so du lieu do thi
28/46
July 22, 2010 28
Subst ruct ure Sim ilar it y Measure
Query r elaxat ion m easureThe number of edges that can be relabeled ormissed; but the position of these edges arenot fixed
QUERY GRAPH
7/31/2019 Co so du lieu do thi
29/46
July 22, 2010 29
Subst ruct ure Sim ilar it y Measure
Feat ure-based sim ilar it y m easure
Each graph is represented as a feature vectorX = {x1, x2, , xn}
The similarity is defined by the distance of their corresponding vectors
AdvantagesEasy to indexFast
Rough measure
7/31/2019 Co so du lieu do thi
30/46
July 22, 2010 30
Query Processing Framew ork
Three steps in processing approximate graphqueries
Step 1. Index Construction
Select small structures as features in agraph database, and build the feature-graph matrix between the features
and the graphs in the database
7/31/2019 Co so du lieu do thi
31/46
July 22, 2010 31
Framew ork ( cont .)
Step 2. Feature Miss EstimationDetermine the indexed features belongingto the query graph
Calculate the upper bound of the numberof features that can be missed for anapproximate matching, denoted by J
On the query graph, not the graphdatabase
7/31/2019 Co so du lieu do thi
32/46
July 22, 2010 32
Framew ork ( cont .)
Step 3. Query ProcessingUse the feature-graph matrix tocalculate the difference in the number
of features between graph G and queryQ, FG FQIf F
G F
Q> J, discard G. The remaining
graphs constitute a candidate answerset
l
7/31/2019 Co so du lieu do thi
33/46
July 22, 2010 33
Outline
Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Bi l i l N k
7/31/2019 Co so du lieu do thi
34/46
July 22, 2010 34
Biological Net w orks
Protein-protein interaction network Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network
7/31/2019 Co so du lieu do thi
35/46
July 22, 2010 35
Dat a Mining Across Mult iple Net w orks
a
b
c
d
e
f
g
h
i
j
k
a
b
d g
h
i
k
c
e
f j
a
b
c
d
e
g
h
k
f
i
j
a
b
c
e
g
h
i
j
k
d
f
a
b
d
e
g i
k
c
f
h j
a
b
c
d
e
g
h
i
k
f j
7/31/2019 Co so du lieu do thi
36/46
July 22, 2010 36
Dat a Mining Across Mult iple Net w orks
a
b
c
d
e
f
g
h
i
j
k
a
b
d g
h
i
k
c
e
f j
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h
i
j
k
a
b
d
e
f
g
h
i
j
k
c
a
b
c
d
e
g
h
i
k
f j
I dent ify Frequent Co-expression Clust ers
7/31/2019 Co so du lieu do thi
37/46
July 22, 2010 37
across Mult iple Microarray Dat a Set sc1 c2 c m
g1 .1 .2 .2g2 .4 .3 .4
c1 c2 c mg1 .8 .6 .2g2 .2 .3 .4
c1 c2 c mg1 .9 .4 .1g2 .7 .3 .5
c1 c2 c mg1 .2 .5 .8g2 .7 .1 .3
.
.
.
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h
i
j
k
a
bd
e
f
g
h
i
j
k
c
.
.
.
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h
i
j
k
a
bd
e
f
g
h
i
j
k
c
.
.
.
CODENSE Mi C h D S b h
7/31/2019 Co so du lieu do thi
38/46
July 22, 2010 38
CODENSE: Mine Coherent Dense Subgraphs
f a
b
d
e
g
h
i
c
G1
a
b
d
e
g
h
i
c
f
summary graph
f
a
b
c
d
e
f
g
h
i
a
b
c
d
e
f
g
h
i
a
b
c
d
e
f
g
h
i
a
b
c
d
e
f
g
h
i
a
b
c
d
e
g
h
i
G3G2
G6G5G4
(1)(1) Builds a summary graph by eliminating infrequent edgesBuilds a summary graph by eliminating infrequent edges
CODENSE: Mine Coherent Dense Subgraphs
7/31/2019 Co so du lieu do thi
39/46
July 22, 2010 39
(2) Identify dense(2) Identify dense subgraphssubgraphs of the summary graphof the summary graph
a
b
d
e
g
h
i
c
f
summary graph
e
g
h
i
c
f
Sub( )
Step 2
MODES
Observation : If a frequent subgraph is dense, it must be adense subgraph in the summary graph. However, the
reverse is not true.
CODENSE: Mine Coherent Dense Subgraphs
Applying CoDense to 39 Yeast Microarray Data Set
7/31/2019 Co so du lieu do thi
40/46
July 22, 2010 40
c1 c2 c mg1 .1 .2 .2g2 .4 .3 .4
c1 c2 c mg1 .8 .6 .2g2 .2 .3 .4
c1 c2 c mg1 .9 .4 .1g2 .7 .3 .5
c1 c2 c mg1 .2 .5 .8g2 .7 .1 .3
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
ef
g
h
i
j
k
a
bd
e
f
g
h
i
j
k
c
a
b
c
d
e
f
g
h
i
j
k
a
b
c
d
e
f
g
h j
k i
a
b
c
d
e
f
g
h
i
j
k
a
bd
e
f
g
h
i
j
k
c
Applying CoDense to 39 Yeast Microarray Data Set
Discovery of New Genes Based on Similar Genes
7/31/2019 Co so du lieu do thi
41/46
July 22, 2010 41
ATP17
ATP12
MRPL38
MRPL37
MRPL39
FMC1MRPS18
MRPL32
ACN9
MRPL51
MRP49YDR115W
PHB1
PET100
Discovery of New Genes Based on Similar Genes
Net w ork of Know n Sim ilar Genes
7/31/2019 Co so du lieu do thi
42/46
July 22, 2010 42
Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18
GO:0019538 (protein metabolism; pvalue = 0.001122)
ATP17
ATP12
MRPL38
MRPL39
FMC1MRPS18
MRPL32
ACN9
MRPL51
MRP49
YDR115W
PHB1
PET100
PET100
Net w ork of Know n Sim ilar Genes
Net w ork I nvolved in t he New Genes
7/31/2019 Co so du lieu do thi
43/46
July 22, 2010 43
Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100GO:0006091 (generation of precursor metabolites and energy; pvalue=0. 001339)
ATP17
ATP12
MRPL38
MRPL37
MRPL39
FMC1MRPS18
MRPL32
ACN9
MRPL51
MRP49YDR115W
PHB1
PET100
Net w ork I nvolved in t he New Genes
Outline
7/31/2019 Co so du lieu do thi
44/46
July 22, 2010 44
Outline
Mining frequent graph patterns
Graph indexing methods
Similairty search in graph databases
Biological network analysis
Conclusions
7/31/2019 Co so du lieu do thi
45/46
July 22, 2010 45
Conclusions
Graph mining has wide applications
Frequent and closed subgraph mining methods
gSpan and CloseGraph: pattern-growth depth-first searchapproach
Graph indexing techniques:
Frequent and discirminative subgraphs as indexing faturesSimilairty search in graph databases
Indexing and approximate matching help similar subgraph search
Biological network analysis
Mining coherent, dense, multiple biological networks
Many new developments along the line of graph pattern mining
Thanks and Quest ions
7/31/2019 Co so du lieu do thi
46/46
July 22, 2010 46
Thanks and Quest ions