Co so du lieu do thi

7/31/2019 Co so du lieu do thi

1/46

July 22, 2010 1

Mining, I ndexing and Searching

Graph Databases

Presenter: A/ Prof. Do PhucSource: Jiawei Han , Vladimir Lipets


2/46

July 22, 2010 2

Graph, Graph, Everyw here

A s p i r i n Yeast prot ein int eract ion net w ork

f r o m

H .

J e o n g e

t a

l N a

t u r e

4 1 1

, 4 1 ( 2 0 0 1 )

An I n t e r n et W eb Co-author net w ork


3/46

July 22, 2010 3

Why Graph Mining and Searching?

Graphs are ubiquitous

Chemical compounds (Cheminformatics)

Protein structures, biological pathways/networks (Bioinformactics)Program control flow, traffic flow, and workflow analysis

XML databases, Web, and social network analysis

Graph is a general model

Trees, lattices, sequences, and items are degenerated graphs

Diversity of graphsDirected vs. undirected, labeled vs. unlabeled (edges & vertices),weighted, with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high complexity!


4/46


5/46July 22, 2010 5

Motivation

Graph, Subgraph isomorphism is important andvery general form of pattern matching that findspractical application in areas such as:

pattern recognition and computer vision,

image processing,computer-aided design, graph grammars,graph transformation,biocomputing,search operation in chemical database,

numerous others.


6/46July 22, 2010 6

A hierarchy of pat t ern m at ching problems

Graph isomorphismSubgraph isomorphismMaximum common subgraph

Approximate subgraph isomorphism

Graph edit distance


7/46July 22, 2010 7

I somorphic Graphs


8/46July 22, 2010 8

Graph Isomorphism


9/46July 22, 2010 9

Subgraph of a given graph


10/46

S b h I hi d R l d


11/46July 22, 2010 11

Subgraph I som orphism and Relat edProblems

Given a pattern graph G and a target graph HDecision problem: Answer whether H contains asubgraph isomorphic to GSearch problem: Return an occurrence of G as a

subgraph of HCounting problem: Return a count of the numberof subgraphs of H that are isomorphic to GEnumeration problem: Return all occurrences of G as a subgraph of H


12/46July 22, 2010 12

Outline

Graph Isomorphism, Subgraph Isomorphism

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis


13/46

July 22, 2010 13

Graph Pat t ern Mining

Frequent subgraphs

A (sub)graph is frequent if its support (occurrencefrequency) in a given dataset is no less than aminimum support threshold

Applications of graph pattern miningMining biochemical structures

Program control flow analysis

Mining XML structures or Web communities

Building blocks for graph classification, clustering,

comparison, and correlation analysis


14/46

July 22, 2010 14

Example: Frequent Subgraphs

S

OH

O

O

O

N

O

N

HO

ON

O

N

(A) (B) (C)

ON

Graph Dataset

Frequent Patterns(min support is 2)

N

O

N

(1) (2)


15/46

July 22, 2010 15

Frequent Subgraph Mining Approaches

Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD00)

FSG: Kuramochi and Karypis (ICDM01)

PATH: Vanetik and Gudes (ICDM02, ICDM04)

FFSM: Huan, et al. (ICDM03)

Pattern growth-based approach

MoFa, Borgelt and Berthold (ICDM02)gSpan: Yan and Han (ICDM02)

Gaston: Nijssen and Kok (KDD04)


16/46

July 22, 2010 16

Proper t ies of Graph Mining Algor it hm s

Search orderbreadth vs. depth

Generation of candidate subgraphsapriori vs. pattern growth

Elimination of duplicate subgraphspassive vs. active

Support calculation

embedding store or notDiscover order of patterns

path

tree

graph


17/46


18/46

July 22, 2010 18

Graph Search: Querying Graph Dat abases

Querying graph databases:

Given a graph database and a query graph,find all graphs containing this query graph

NN

O H O N

O

N

O H

O

N N + N H

N

O N H O

N

N

S

O H

S

H O O

O N

N

O

O

query graph graph database


19/46

July 22, 2010 19

Scalabil it y I ssue

Sequential scanDisk I/O

Subgraph isomorphismtesting

An indexing mechanism isneeded

DayLight: Daylight.com(commercial)GraphGrep: Dennis Shasha,et al. PODS'02

Grace: Srinath Srinivasa, etal. ICDE'03

Sample database

OHO

N

N +

NH

N

O

N

HO

N

N

S

OH

S

HOO

O

N

N

O

O

OH

ON

O

N

(a) (b) (c)

NN

Query graph


20/46

July 22, 2010 20

I ndexing St rat egy

Graph (G)

Substructure

Query graph (Q)

If graph G contains querygraph Q, G should containany substructure of Q

RemarksIndex substructures of a query graph toprune graphs that do not contain these

substructures


21/46


22/46

July 22, 2010 22

Outline





Some recent progress on graph mining


23/46

July 22, 2010 23

Graph Clust er ing

Graph similarity measure

Feature-based similarity measureEach graph is represented as a feature vector

The similarity is defined by the distance of their

corresponding vectorsFrequent subgraphs can be used as features

Structure-based similarity measureMaximal common subgraph

Graph edit distance: insertion, deletion, and relabel

Graph alignment distance


24/46

July 22, 2010 24

Graph Classif icat ion

Local structure based approachLocal structures in a graph, e.g., neighbors

surrounding a vertex, paths with fixed lengthGraph pattern-based approach

Subgraph patterns from domain knowledgeSubgraph patterns from data miningKernel-based approach

Random walk (Grtner 02, Kashima et al. 02,ICML03, Mah et al. ICML04)

Optimal local assignment (Frhlich et al.ICML05


25/46

July 22, 2010 25

St ruct ure Sim ilar it y Search

(a) caffeine (b) diurobromine (c) viagra

CHEMICAL COMPOUNDS

QUERY GRAPH


26/46

July 22, 2010 26

Some St raight forw ard Met hods

Method1: Directly compute the similarity between the

graphs in the DB and the query graph

Sequential scan

Subgraph similarity computation

Method 2: Form a set of subgraph queries from the

original query graph and use the exact subgraph

search

Costly: If we allow 3 edges to be missed in a 20-

edge query graph, it may generate 1,140 subgraphs


27/46

July 22, 2010 27

I ndex: Precise vs. Approxim at e Search

Precise SearchUse frequent patterns as indexing features

Select features in the dat abase space based on theirselectivityBuild the index

Approximate SearchHard to build indices covering similar subgraphs

explosive number of subgraphs in databasesIdea: (1) keep the index structure(2) select features in the query space


28/46

July 22, 2010 28

Subst ruct ure Sim ilar it y Measure

Query r elaxat ion m easureThe number of edges that can be relabeled ormissed; but the position of these edges arenot fixed

QUERY GRAPH


29/46

July 22, 2010 29

Subst ruct ure Sim ilar it y Measure

Feat ure-based sim ilar it y m easure

Each graph is represented as a feature vectorX = {x1, x2, , xn}

The similarity is defined by the distance of their corresponding vectors

AdvantagesEasy to indexFast

Rough measure


30/46

July 22, 2010 30

Query Processing Framew ork

Three steps in processing approximate graphqueries

Step 1. Index Construction

Select small structures as features in agraph database, and build the feature-graph matrix between the features

and the graphs in the database


31/46

July 22, 2010 31

Framew ork ( cont .)

Step 2. Feature Miss EstimationDetermine the indexed features belongingto the query graph

Calculate the upper bound of the numberof features that can be missed for anapproximate matching, denoted by J

On the query graph, not the graphdatabase


32/46

July 22, 2010 32

Framew ork ( cont .)

Step 3. Query ProcessingUse the feature-graph matrix tocalculate the difference in the number

of features between graph G and queryQ, FG FQIf F

G F

Q> J, discard G. The remaining

graphs constitute a candidate answerset

l


33/46

July 22, 2010 33

Outline





Bi l i l N k


34/46

July 22, 2010 34

Biological Net w orks

Protein-protein interaction network Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network


35/46

July 22, 2010 35

Dat a Mining Across Mult iple Net w orks

a

b

c

d

e

f

g

h

i

j

k

a

b

d g

h

i

k

c

e

f j

a

b

c

d

e

g

h

k

f

i

j

a

b

c

e

g

h

i

j

k

d

f

a

b

d

e

g i

k

c

f

h j

a

b

c

d

e

g

h

i

k

f j


36/46

July 22, 2010 36

Dat a Mining Across Mult iple Net w orks

a

b

c

d

e

f

g

h

i

j

k

a

b

d g

h

i

k

c

e

f j

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h

i

j

k

a

b

d

e

f

g

h

i

j

k

c

a

b

c

d

e

g

h

i

k

f j

I dent ify Frequent Co-expression Clust ers


37/46

July 22, 2010 37

across Mult iple Microarray Dat a Set sc1 c2 c m

g1 .1 .2 .2g2 .4 .3 .4

c1 c2 c mg1 .8 .6 .2g2 .2 .3 .4

c1 c2 c mg1 .9 .4 .1g2 .7 .3 .5

c1 c2 c mg1 .2 .5 .8g2 .7 .1 .3

.

.

.

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h

i

j

k

a

bd

e

f

g

h

i

j

k

c

.

.

.

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h

i

j

k

a

bd

e

f

g

h

i

j

k

c

.

.

.

CODENSE Mi C h D S b h


38/46

July 22, 2010 38

CODENSE: Mine Coherent Dense Subgraphs

f a

b

d

e

g

h

i

c

G1

a

b

d

e

g

h

i

c

f

summary graph

f

a

b

c

d

e

f

g

h

i

a

b

c

d

e

f

g

h

i

a

b

c

d

e

f

g

h

i

a

b

c

d

e

f

g

h

i

a

b

c

d

e

g

h

i

G3G2

G6G5G4

(1)(1) Builds a summary graph by eliminating infrequent edgesBuilds a summary graph by eliminating infrequent edges



39/46

July 22, 2010 39

(2) Identify dense(2) Identify dense subgraphssubgraphs of the summary graphof the summary graph

a

b

d

e

g

h

i

c

f

summary graph

e

g

h

i

c

f

Sub( )

Step 2

MODES

Observation : If a frequent subgraph is dense, it must be adense subgraph in the summary graph. However, the

reverse is not true.


Applying CoDense to 39 Yeast Microarray Data Set


40/46

July 22, 2010 40

c1 c2 c mg1 .1 .2 .2g2 .4 .3 .4

c1 c2 c mg1 .8 .6 .2g2 .2 .3 .4

c1 c2 c mg1 .9 .4 .1g2 .7 .3 .5

c1 c2 c mg1 .2 .5 .8g2 .7 .1 .3

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

ef

g

h

i

j

k

a

bd

e

f

g

h

i

j

k

c

a

b

c

d

e

f

g

h

i

j

k

a

b

c

d

e

f

g

h j

k i

a

b

c

d

e

f

g

h

i

j

k

a

bd

e

f

g

h

i

j

k

c

Applying CoDense to 39 Yeast Microarray Data Set

Discovery of New Genes Based on Similar Genes


41/46

July 22, 2010 41

ATP17

ATP12

MRPL38

MRPL37

MRPL39

FMC1MRPS18

MRPL32

ACN9

MRPL51

MRP49YDR115W

PHB1

PET100

Discovery of New Genes Based on Similar Genes

Net w ork of Know n Sim ilar Genes


42/46

July 22, 2010 42

Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18

GO:0019538 (protein metabolism; pvalue = 0.001122)

ATP17

ATP12

MRPL38

MRPL39

FMC1MRPS18

MRPL32

ACN9

MRPL51

MRP49

YDR115W

PHB1

PET100

PET100

Net w ork of Know n Sim ilar Genes

Net w ork I nvolved in t he New Genes


43/46

July 22, 2010 43

Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100GO:0006091 (generation of precursor metabolites and energy; pvalue=0. 001339)

ATP17

ATP12

MRPL38

MRPL37

MRPL39

FMC1MRPS18

MRPL32

ACN9

MRPL51

MRP49YDR115W

PHB1

PET100

Net w ork I nvolved in t he New Genes

Outline


44/46

July 22, 2010 44

Outline





Conclusions


45/46

July 22, 2010 45

Conclusions

Graph mining has wide applications

Frequent and closed subgraph mining methods

gSpan and CloseGraph: pattern-growth depth-first searchapproach

Graph indexing techniques:

Frequent and discirminative subgraphs as indexing faturesSimilairty search in graph databases

Indexing and approximate matching help similar subgraph search


Mining coherent, dense, multiple biological networks

Many new developments along the line of graph pattern mining

Thanks and Quest ions


46/46

July 22, 2010 46

Thanks and Quest ions

Co so du lieu do thi

Documents