© 2008 IBM Corporation Mining Significant Graph Patterns by Leap Search Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)

Post on 20-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

© 2008 IBM Corporation

Mining Significant Graph Patterns by Leap Search

Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)

2

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Graph Patterns

Interestingness measures / Objective functions

• Frequency: frequent graph pattern

• Discriminative: information gain, Fisher score

• Significance: G-test

• …

3

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Frequent Graph Pattern

4

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Optimal Graph Pattern (this work)

5

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Objective Functions

Challenge: Not Anti-Monotonic

X

6

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Challenge: Non Anti-Monotonic

Anti-Monotonic

Non Monotonic

Non-Monotonic: Enumerate all subgraphs then check their score?

Enumerate subgraphs : small-size to large-size

7

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Frequent Pattern Based Mining Framework

Exploratory task

Graph clustering

Graph classification

Graph index

(SIGMOD’04, ’05)(ISMB’05, ’07)

Graph Database Frequent Patterns Optimal Patterns

1. Bottleneck : millions, even billions of patterns

2. No guarantee of quality

8

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Direct Pattern Mining Framework

Exploratory task

Graph clustering

Graph classification

Graph index

Graph Database Optimal Patterns

Direct

How?

9

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Upper-Bound

10

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Upper-Bound: Anti-Monotonic (cont.)

Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting

We can recycle the existing graph mining algorithms to accommodate non-monotonic functions.

11

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Vertical Pruning

Larg

e <- s

mall

12

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Horizontal Pruning: Structural Proximity

13

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Structural Proximity: Another Perspective

# of frequent patterns >> # of possible frequency pairs

Many patterns share the same score

14

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Frequency Envelope

15

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Structural Leap Search

16

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Frequency Association

Significant patterns often fall into the high-quantile of frequency

Starting with the most frequent patterns

17

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Descending Leap Mine

1. Structural Leap Searchwith frequency threshold

3. Structural Leap Search

2. Support-Descending Mining

F(g*) converges

18

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Results: NCI Anti-Cancer Screen Datasets

Name # of Compounds Tumor Description

MCF-7 27,770 Breast

MOLT-4 39,765 Leukemia

NCI-H23 40,353 Non-Small Cell Lung

OVCAR-8 40,516 Ovarian

P388 41,472 Leukemia

PC-3 27,509 Prostate

SF-295 40,271 Central Nerve System

SN12C 40,004 Renal

SW-620 40,532 Colon

UACC257 39,988 Melanoma

YEAST 79,601 Yeast anti-cancer

Link: http://pubchem.ncbi.nlm.nih.gov

Chemical Compounds: anti-cancer or not

# of vertices: 10 ~ 200

19

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Efficiency

Vertical Pruning

Horizontal Pruning

20

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Effectiveness (runtime)

frequency descending

frequency descending+ leap mine

21

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Effectiveness (accuracy)

slightly different

22

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Graph Classification

Name OA Kernel LEAP OA Kernel (6x) LEAP (6x)

Average (AUC) 0.70 0.72 0.75 0.77

* OA Kernel: Optimal Assignment Kernel LEAP: LEAP search

(6x)

(6x)

23

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Scalability Means Something !

LEAP

OA

LEAP(6X)

OA(6X)

~20sec

~100sec

~200sec

~8000sec

Linear

Quadratic

24

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Direct Pattern Mining Framework

Exploratory task

Graph clustering

Graph classification

Graph index

Graph Database Optimal Graph Patterns

Direct

25

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Beyond Graph Patterns

Exploratory task

Clustering

Classification

Index

itemset/sequence/tree Database Optimal Patterns

Direct

1. Direct mining can be applied to itemsets, sequences, and trees

2. Existing algorithms can be recycled to mine patterns with sophisticated measures.

3. Pattern-based methods including indexing and classification are competitive.

26

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Thank you

Direct Mining of Discriminative and Essential Graphicaland Itemset Features via Model-based Search Tree

SIGKDD’08 @ Las Vegas

27

IBM T. J. Watson Research Center

Graph Pattern Mining | © 2008 IBM Corporation

Graph Classification: Kernel Approach

Kernel-based Graph Classification

Optimal Assignment Kernel (Fröhlich et al. ICML’05)

top related