Top Banner

of 46

Co so du lieu do thi

Apr 05, 2018

Download

Documents

Phan Duy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 Co so du lieu do thi

    1/46

    July 22, 2010 1

    Mining, I ndexing and Searching

    Graph Databases

    Presenter: A/ Prof. Do PhucSource: Jiawei Han , Vladimir Lipets

  • 7/31/2019 Co so du lieu do thi

    2/46

    July 22, 2010 2

    Graph, Graph, Everyw here

    A s p i r i n Yeast prot ein int eract ion net w ork

    f r o m

    H .

    J e o n g e

    t a

    l N a

    t u r e

    4 1 1

    , 4 1 ( 2 0 0 1 )

    An I n t e r n et W eb Co-author net w ork

  • 7/31/2019 Co so du lieu do thi

    3/46

    July 22, 2010 3

    Why Graph Mining and Searching?

    Graphs are ubiquitous

    Chemical compounds (Cheminformatics)

    Protein structures, biological pathways/networks (Bioinformactics)Program control flow, traffic flow, and workflow analysis

    XML databases, Web, and social network analysis

    Graph is a general model

    Trees, lattices, sequences, and items are degenerated graphs

    Diversity of graphsDirected vs. undirected, labeled vs. unlabeled (edges & vertices),weighted, with angles & geometry (topological vs. 2-D/3-D)

    Complexity of algorithms: many problems are of high complexity!

  • 7/31/2019 Co so du lieu do thi

    4/46

  • 7/31/2019 Co so du lieu do thi

    5/46July 22, 2010 5

    Motivation

    Graph, Subgraph isomorphism is important andvery general form of pattern matching that findspractical application in areas such as:

    pattern recognition and computer vision,

    image processing,computer-aided design, graph grammars,graph transformation,biocomputing,search operation in chemical database,

    numerous others.

  • 7/31/2019 Co so du lieu do thi

    6/46July 22, 2010 6

    A hierarchy of pat t ern m at ching problems

    Graph isomorphismSubgraph isomorphismMaximum common subgraph

    Approximate subgraph isomorphism

    Graph edit distance

  • 7/31/2019 Co so du lieu do thi

    7/46July 22, 2010 7

    I somorphic Graphs

  • 7/31/2019 Co so du lieu do thi

    8/46July 22, 2010 8

    Graph Isomorphism

  • 7/31/2019 Co so du lieu do thi

    9/46July 22, 2010 9

    Subgraph of a given graph

  • 7/31/2019 Co so du lieu do thi

    10/46

    S b h I hi d R l d

  • 7/31/2019 Co so du lieu do thi

    11/46July 22, 2010 11

    Subgraph I som orphism and Relat edProblems

    Given a pattern graph G and a target graph HDecision problem: Answer whether H contains asubgraph isomorphic to GSearch problem: Return an occurrence of G as a

    subgraph of HCounting problem: Return a count of the numberof subgraphs of H that are isomorphic to GEnumeration problem: Return all occurrences of G as a subgraph of H

  • 7/31/2019 Co so du lieu do thi

    12/46July 22, 2010 12

    Outline

    Graph Isomorphism, Subgraph Isomorphism

    Mining frequent graph patterns

    Graph indexing methods

    Similairty search in graph databases

    Biological network analysis

  • 7/31/2019 Co so du lieu do thi

    13/46

    July 22, 2010 13

    Graph Pat t ern Mining

    Frequent subgraphs

    A (sub)graph is frequent if its support (occurrencefrequency) in a given dataset is no less than aminimum support threshold

    Applications of graph pattern miningMining biochemical structures

    Program control flow analysis

    Mining XML structures or Web communities

    Building blocks for graph classification, clustering,

    comparison, and correlation analysis

  • 7/31/2019 Co so du lieu do thi

    14/46

    July 22, 2010 14

    Example: Frequent Subgraphs

    S

    OH

    O

    O

    O

    N

    O

    N

    HO

    ON

    O

    N

    (A) (B) (C)

    ON

    Graph Dataset

    Frequent Patterns(min support is 2)

    N

    O

    N

    (1) (2)

  • 7/31/2019 Co so du lieu do thi

    15/46

    July 22, 2010 15

    Frequent Subgraph Mining Approaches

    Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD00)

    FSG: Kuramochi and Karypis (ICDM01)

    PATH: Vanetik and Gudes (ICDM02, ICDM04)

    FFSM: Huan, et al. (ICDM03)

    Pattern growth-based approach

    MoFa, Borgelt and Berthold (ICDM02)gSpan: Yan and Han (ICDM02)

    Gaston: Nijssen and Kok (KDD04)

  • 7/31/2019 Co so du lieu do thi

    16/46

    July 22, 2010 16

    Proper t ies of Graph Mining Algor it hm s

    Search orderbreadth vs. depth

    Generation of candidate subgraphsapriori vs. pattern growth

    Elimination of duplicate subgraphspassive vs. active

    Support calculation

    embedding store or notDiscover order of patterns

    path

    tree

    graph

  • 7/31/2019 Co so du lieu do thi

    17/46

  • 7/31/2019 Co so du lieu do thi

    18/46

    July 22, 2010 18

    Graph Search: Querying Graph Dat abases

    Querying graph databases:

    Given a graph database and a query graph,find all graphs containing this query graph

    NN

    O H O N

    O

    N

    O H

    O

    N N + N H

    N

    O N H O

    N

    N

    S

    O H

    S

    H O O

    O N

    N

    O

    O

    query graph graph database

  • 7/31/2019 Co so du lieu do thi

    19/46

    July 22, 2010 19

    Scalabil it y I ssue

    Sequential scanDisk I/O

    Subgraph isomorphismtesting

    An indexing mechanism isneeded

    DayLight: Daylight.com(commercial)GraphGrep: Dennis Shasha,et al. PODS'02

    Grace: Srinath Srinivasa, etal. ICDE'03

    Sample database

    OHO

    N

    N +

    NH

    N

    O

    N

    HO

    N

    N

    S

    OH

    S

    HOO

    O

    N

    N

    O

    O

    OH

    ON

    O

    N

    (a) (b) (c)

    NN

    Query graph

  • 7/31/2019 Co so du lieu do thi

    20/46

    July 22, 2010 20

    I ndexing St rat egy

    Graph (G)

    Substructure

    Query graph (Q)

    If graph G contains querygraph Q, G should containany substructure of Q

    RemarksIndex substructures of a query graph toprune graphs that do not contain these

    substructures

  • 7/31/2019 Co so du lieu do thi

    21/46

  • 7/31/2019 Co so du lieu do thi

    22/46

    July 22, 2010 22

    Outline

    Mining frequent graph patterns

    Graph indexing methods

    Similairty search in graph databases

    Biological network analysis

    Some recent progress on graph mining

  • 7/31/2019 Co so du lieu do thi

    23/46

    July 22, 2010 23

    Graph Clust er ing

    Graph similarity measure

    Feature-based similarity measureEach graph is represented as a feature vector

    The similarity is defined by the distance of their

    corresponding vectorsFrequent subgraphs can be used as features

    Structure-based similarity measureMaximal common subgraph

    Graph edit distance: insertion, deletion, and relabel

    Graph alignment distance

  • 7/31/2019 Co so du lieu do thi

    24/46

    July 22, 2010 24

    Graph Classif icat ion

    Local structure based approachLocal structures in a graph, e.g., neighbors

    surrounding a vertex, paths with fixed lengthGraph pattern-based approach

    Subgraph patterns from domain knowledgeSubgraph patterns from data miningKernel-based approach

    Random walk (Grtner 02, Kashima et al. 02,ICML03, Mah et al. ICML04)

    Optimal local assignment (Frhlich et al.ICML05

  • 7/31/2019 Co so du lieu do thi

    25/46

    July 22, 2010 25

    St ruct ure Sim ilar it y Search

    (a) caffeine (b) diurobromine (c) viagra

    CHEMICAL COMPOUNDS

    QUERY GRAPH

  • 7/31/2019 Co so du lieu do thi

    26/46

    July 22, 2010 26

    Some St raight forw ard Met hods

    Method1: Directly compute the similarity between the

    graphs in the DB and the query graph

    Sequential scan

    Subgraph similarity computation

    Method 2: Form a set of subgraph queries from the

    original query graph and use the exact subgraph

    search

    Costly: If we allow 3 edges to be missed in a 20-

    edge query graph, it may generate 1,140 subgraphs

  • 7/31/2019 Co so du lieu do thi

    27/46

    July 22, 2010 27

    I ndex: Precise vs. Approxim at e Search

    Precise SearchUse frequent patterns as indexing features

    Select features in the dat abase space based on theirselectivityBuild the index

    Approximate SearchHard to build indices covering similar subgraphs

    explosive number of subgraphs in databasesIdea: (1) keep the index structure(2) select features in the query space

  • 7/31/2019 Co so du lieu do thi

    28/46

    July 22, 2010 28

    Subst ruct ure Sim ilar it y Measure

    Query r elaxat ion m easureThe number of edges that can be relabeled ormissed; but the position of these edges arenot fixed

    QUERY GRAPH

  • 7/31/2019 Co so du lieu do thi

    29/46

    July 22, 2010 29

    Subst ruct ure Sim ilar it y Measure

    Feat ure-based sim ilar it y m easure

    Each graph is represented as a feature vectorX = {x1, x2, , xn}

    The similarity is defined by the distance of their corresponding vectors

    AdvantagesEasy to indexFast

    Rough measure

  • 7/31/2019 Co so du lieu do thi

    30/46

    July 22, 2010 30

    Query Processing Framew ork

    Three steps in processing approximate graphqueries

    Step 1. Index Construction

    Select small structures as features in agraph database, and build the feature-graph matrix between the features

    and the graphs in the database

  • 7/31/2019 Co so du lieu do thi

    31/46

    July 22, 2010 31

    Framew ork ( cont .)

    Step 2. Feature Miss EstimationDetermine the indexed features belongingto the query graph

    Calculate the upper bound of the numberof features that can be missed for anapproximate matching, denoted by J

    On the query graph, not the graphdatabase

  • 7/31/2019 Co so du lieu do thi

    32/46

    July 22, 2010 32

    Framew ork ( cont .)

    Step 3. Query ProcessingUse the feature-graph matrix tocalculate the difference in the number

    of features between graph G and queryQ, FG FQIf F

    G F

    Q> J, discard G. The remaining

    graphs constitute a candidate answerset

    l

  • 7/31/2019 Co so du lieu do thi

    33/46

    July 22, 2010 33

    Outline

    Mining frequent graph patterns

    Graph indexing methods

    Similairty search in graph databases

    Biological network analysis

    Bi l i l N k

  • 7/31/2019 Co so du lieu do thi

    34/46

    July 22, 2010 34

    Biological Net w orks

    Protein-protein interaction network Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network

  • 7/31/2019 Co so du lieu do thi

    35/46

    July 22, 2010 35

    Dat a Mining Across Mult iple Net w orks

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    d g

    h

    i

    k

    c

    e

    f j

    a

    b

    c

    d

    e

    g

    h

    k

    f

    i

    j

    a

    b

    c

    e

    g

    h

    i

    j

    k

    d

    f

    a

    b

    d

    e

    g i

    k

    c

    f

    h j

    a

    b

    c

    d

    e

    g

    h

    i

    k

    f j

  • 7/31/2019 Co so du lieu do thi

    36/46

    July 22, 2010 36

    Dat a Mining Across Mult iple Net w orks

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    d g

    h

    i

    k

    c

    e

    f j

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    d

    e

    f

    g

    h

    i

    j

    k

    c

    a

    b

    c

    d

    e

    g

    h

    i

    k

    f j

    I dent ify Frequent Co-expression Clust ers

  • 7/31/2019 Co so du lieu do thi

    37/46

    July 22, 2010 37

    across Mult iple Microarray Dat a Set sc1 c2 c m

    g1 .1 .2 .2g2 .4 .3 .4

    c1 c2 c mg1 .8 .6 .2g2 .2 .3 .4

    c1 c2 c mg1 .9 .4 .1g2 .7 .3 .5

    c1 c2 c mg1 .2 .5 .8g2 .7 .1 .3

    .

    .

    .

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    bd

    e

    f

    g

    h

    i

    j

    k

    c

    .

    .

    .

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    bd

    e

    f

    g

    h

    i

    j

    k

    c

    .

    .

    .

    CODENSE Mi C h D S b h

  • 7/31/2019 Co so du lieu do thi

    38/46

    July 22, 2010 38

    CODENSE: Mine Coherent Dense Subgraphs

    f a

    b

    d

    e

    g

    h

    i

    c

    G1

    a

    b

    d

    e

    g

    h

    i

    c

    f

    summary graph

    f

    a

    b

    c

    d

    e

    f

    g

    h

    i

    a

    b

    c

    d

    e

    f

    g

    h

    i

    a

    b

    c

    d

    e

    f

    g

    h

    i

    a

    b

    c

    d

    e

    f

    g

    h

    i

    a

    b

    c

    d

    e

    g

    h

    i

    G3G2

    G6G5G4

    (1)(1) Builds a summary graph by eliminating infrequent edgesBuilds a summary graph by eliminating infrequent edges

    CODENSE: Mine Coherent Dense Subgraphs

  • 7/31/2019 Co so du lieu do thi

    39/46

    July 22, 2010 39

    (2) Identify dense(2) Identify dense subgraphssubgraphs of the summary graphof the summary graph

    a

    b

    d

    e

    g

    h

    i

    c

    f

    summary graph

    e

    g

    h

    i

    c

    f

    Sub( )

    Step 2

    MODES

    Observation : If a frequent subgraph is dense, it must be adense subgraph in the summary graph. However, the

    reverse is not true.

    CODENSE: Mine Coherent Dense Subgraphs

    Applying CoDense to 39 Yeast Microarray Data Set

  • 7/31/2019 Co so du lieu do thi

    40/46

    July 22, 2010 40

    c1 c2 c mg1 .1 .2 .2g2 .4 .3 .4

    c1 c2 c mg1 .8 .6 .2g2 .2 .3 .4

    c1 c2 c mg1 .9 .4 .1g2 .7 .3 .5

    c1 c2 c mg1 .2 .5 .8g2 .7 .1 .3

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    ef

    g

    h

    i

    j

    k

    a

    bd

    e

    f

    g

    h

    i

    j

    k

    c

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    b

    c

    d

    e

    f

    g

    h j

    k i

    a

    b

    c

    d

    e

    f

    g

    h

    i

    j

    k

    a

    bd

    e

    f

    g

    h

    i

    j

    k

    c

    Applying CoDense to 39 Yeast Microarray Data Set

    Discovery of New Genes Based on Similar Genes

  • 7/31/2019 Co so du lieu do thi

    41/46

    July 22, 2010 41

    ATP17

    ATP12

    MRPL38

    MRPL37

    MRPL39

    FMC1MRPS18

    MRPL32

    ACN9

    MRPL51

    MRP49YDR115W

    PHB1

    PET100

    Discovery of New Genes Based on Similar Genes

    Net w ork of Know n Sim ilar Genes

  • 7/31/2019 Co so du lieu do thi

    42/46

    July 22, 2010 42

    Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18

    GO:0019538 (protein metabolism; pvalue = 0.001122)

    ATP17

    ATP12

    MRPL38

    MRPL39

    FMC1MRPS18

    MRPL32

    ACN9

    MRPL51

    MRP49

    YDR115W

    PHB1

    PET100

    PET100

    Net w ork of Know n Sim ilar Genes

    Net w ork I nvolved in t he New Genes

  • 7/31/2019 Co so du lieu do thi

    43/46

    July 22, 2010 43

    Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100GO:0006091 (generation of precursor metabolites and energy; pvalue=0. 001339)

    ATP17

    ATP12

    MRPL38

    MRPL37

    MRPL39

    FMC1MRPS18

    MRPL32

    ACN9

    MRPL51

    MRP49YDR115W

    PHB1

    PET100

    Net w ork I nvolved in t he New Genes

    Outline

  • 7/31/2019 Co so du lieu do thi

    44/46

    July 22, 2010 44

    Outline

    Mining frequent graph patterns

    Graph indexing methods

    Similairty search in graph databases

    Biological network analysis

    Conclusions

  • 7/31/2019 Co so du lieu do thi

    45/46

    July 22, 2010 45

    Conclusions

    Graph mining has wide applications

    Frequent and closed subgraph mining methods

    gSpan and CloseGraph: pattern-growth depth-first searchapproach

    Graph indexing techniques:

    Frequent and discirminative subgraphs as indexing faturesSimilairty search in graph databases

    Indexing and approximate matching help similar subgraph search

    Biological network analysis

    Mining coherent, dense, multiple biological networks

    Many new developments along the line of graph pattern mining

    Thanks and Quest ions

  • 7/31/2019 Co so du lieu do thi

    46/46

    July 22, 2010 46

    Thanks and Quest ions