Top Banner
Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania State University 1 © Dayu Yuan 12/31/21
72

Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Jan 21, 2016

Download

Documents

Georgia Ward
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Feature Mining for Indexing and Classification

A Thesis Proposal in Department of Computer Science and Engineering

by Dayu Yuan

The Pennsylvania State University

1 © Dayu Yuan 04/21/23

Page 2: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Welcome Committee Chairs: Dr. Prasenjit Mitra Dr. C. Lee Giles

CSE Department Faculty Members: Dr. Jessy Barlow Dr. Daniel Kifer

Outside Member: Dr. Zan Huang

04/21/23© Dayu Yuan2

Page 3: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Outline 1. Motivation & Introduction 2. Graph index design for subgraph search

(WebDB 2011)

3. Graph feature mining for subgraph search (In submission)

4. Future work

04/21/23© Dayu Yuan3

Page 4: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Motivation: Graphs are prevalent: Chemistry: Chemical Molecule Biology: Protein Structure Computer Aided Design: Image Processing & Computer Vision: Social Network:

04/21/23© Dayu Yuan4

Above figures all come from internet

Mine and Manage Graph Data

Page 5: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Our focus Data: Graph Database: a collection of graphs Graph Scale: hundreds of nodes & edges

Chemical Molecules Small Social Communities Mechanical Parts

Graph Type: labeled, connected, undirected graph (can be extend to other types)

04/21/23© Dayu Yuan5

ac

b

d c d

g2 g4

ac

g3

b

b a

c

b dda

b

c d

g5

a

a

ab

c d

ag1

c

d

Page 6: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Feature/Pattern Nothing but Subgraphs/Subtrees/Random walk paths Exponential number of subgraphs in a graph database

Impossible to enumerate Frequent subgraphs are popular

04/21/23© Dayu Yuan6

ac

b

d c d

g2 g4

ac

g3

b

b a

c

b dda

b

c d

g5

b

c d

aa

b

c c

b

d

P1 P2 P3

a

b

c c

dP4

a

ab

c d

ag1

c

d

Features

Graph Database

Page 7: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Feature/Pattern

04/21/23© Dayu Yuan7

ac

b

d c d

g2 g4

ac

g3

b

b a

c

b dda

b

c d

g5

ab

c

P1

c

b

d

P3

a

b

c d

aP2

b

c c

dP4

a

ab

c d

ag1

c

d

Features

Graph Database

P1: g1, g2, g3, g4 P2: g1, g4

P3: g2, g3, g4, g5 P4: g1, g3

Page 8: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Features in Subgraph Search Subgraph Search: In a graph database D = {g1,g2,...gn}, given a query

graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.

04/21/23© Dayu Yuan8

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

ac

b

dq1

Page 9: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Features in Subgraph Search Filtering + Verification (gIndex 04): If a graph g contains the query q, then g has to

contain all q’s subgraphs.

04/21/23© Dayu Yuan9

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

ac

b

dq1

ab

c

P1

c

b

d

P3

b

c d

aP2

b

c c

dP4

FeaturesP1: g1, g2, g3, g4

P2: g1, g4

P3: g2, g3, g4, g5

P4: g1, g3

Page 10: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Features in Supergraph Search

04/21/23© Dayu Yuan10

Supergraph Search: In a graph database D = {g1,g2,...gn}, given a query

graph q, the supergraph search algorithm returns all database graphs have q as a supergraph.

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

c dq2

c

b ca

a c

Page 11: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Features in Supergraph Search

04/21/23© Dayu Yuan11

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

Filter + Verification (cIndex 07): For any subgraph features that are not

contained in the query, their supporting sets can be filtered.

ab

c

P1

c

b

d

P3

b

c d

aP2

b

c c

dP4

FeaturesP1: g1, g2, g3, g4

P2: g1, g4

P3: g2, g3, g4, g5

P4: g1, g3

c dq2

c

b ca

a c

Page 12: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Features in Graph Classification Graph Classification: Protein activity prediction Drug toxicity prediction Image classification

Graph Kernel: Hard to interpret the results and the

rules

04/21/23© Dayu Yuan12

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

P1: g1, g2, g3, g4

P2: g1, g4

P3: g2, g3, g4, g5

P4: g1, g3

[1,1,0,1]T [1,0,1,0]T [1,0,1,1]T[1,1,1,0]T [0,0,1,0]T

Page 13: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Feature Mining Motivation Graph query operations (subgraph/supergraph

search) need features to build the index Graph learning needs features to explicitly

represent graphs as vectors Challenges: Indexing: limited memory Classification: curse of dimensionality Exponential number of subgraphs Too many frequent subgraphs, most of them are

redundant & not discriminative/informative

04/21/23© Dayu Yuan13

Page 14: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Research in Graph Feature Mining

04/21/23© Dayu Yuan14

1. Mine Frequent Subgraphs:(1) Not computational efficient(2) Curse of Dimensionality(3) Most frequent subgraphs are redundant or not

discrimiantive.

Page 15: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

04/21/23© Dayu Yuan15

2. Batch mode discriminative & frequent subgraph mining(A)First enumerate all frequent subgraphs [Bottleneck](B)Mine discriminative & frequent subgraphs out of frequent subgraphs

Challenge:How to set the minimum support in step (A)? Small? Big?

Research in Graph Feature Mining

Page 16: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Research in Graph Feature Mining

04/21/23© Dayu Yuan16

3. Direct Feature Mining: A. Search for a feature f optimizing an objective functionB. Find K features: run above algorithm iteratively (forward feature selection)

Page 17: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Proposal: Our initial study on Subgraph Search Graph Index Structure Design Graph Feature Mining

Future work plan Supergraph Search

Graph Index Structure Design (initial study) Graph Feature Mining

Graph Classification Graph Descriptor Mining for Classification Irredundant Iterative Feature Mining

04/21/23© Dayu Yuan17

Page 18: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Outline 1. Motivation & Introduction 2. Graph index design for subgraph search

(WebDB 2011) Background of Subgraph Search:

Preliminary & Problem Definition Filter + Verification [Feature Based Index Approach]

3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search

4. Future work

04/21/23© Dayu Yuan18

Page 19: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Problem Definition: In a graph database D = {g1,g2,...gn}, given a query

graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.

Solution: Brute force: For each query q, scan the dataset, find

D(q) Filter + Verification:

Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q)

Subgraph Search: Definition

Dq

C(q) D(q)

C(q) = D

19 © Dayu Yuan 04/21/23

Page 20: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Filter + Verification: Rule:

If a graph g contains the query q, then g has to contain all q’s subgraphs.

Inverted Index: <Key, Value> pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the

“key” subgraph)

Subgraph Search: Solutions

20 © Dayu Yuan 04/21/23

Page 21: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Total Query Processing Time: (1) filtering cost: D to C(q)

Cost of the search for subgraph features contained in the query

Cost of loading the postings file, cost of intersecting the postings

(2) verification cost: C(q) to D(q) subgraph isomorphism tests

NP-complete, dominates overall cost Related work: Reduce the verification cost by mining subgraph

features Disadvantages: (1) “Batch mode” feature mining (2) Different index structure designed for different features

Subgraph Search: Related Work

21 © Dayu Yuan 04/21/23

Page 22: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Outline 1. Motivation & Introduction 2. Graph index design for subgraph search

(WebDB 2011) Background of Subgraph Search: Lindex: A general index structure for subsearch

Effective (filtering power) Efficient (response time) Compact (memory consumption) Experiment Results

3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search

4. Future work 04/21/23© Dayu Yuan22

Page 23: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Lindex: A general index structure

Contributions: Orthogonal to related work (feature mining) Applicable to all subgraph/subtree features. Lindex decouples

feature mining and index structure design Compact, Effective and Efficient

23 © Dayu Yuan 04/21/23

Lattice-like Index:(1)Organize indexing features with a lattice(2)Partition the value set (supporting set/ posting

list)

Page 24: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Definition (maxSub, minSuper). S is all indexing features

Lindex: Effective in Filtering

maxSub(g,S)={gi ∈S|gi ⊂ g,¬∃x∈Ss.t.gi ⊂ x⊂ g}minSup(g,S) ={gi ∈S|g⊂ gi ,¬∃x∈Ss.t.g⊂ x⊂ gi}

(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q

24 © Dayu Yuan 04/21/23

back

Page 25: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on

which an algorithm should check for subgraph isomorphism is

Lindex: Effective in Filtering

C(q)=I i D( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)

C(q)=D(sg2 ) I D(sg4 )−D(sg5 )={a,b,c} I {a,b,d} −{b}={a,b} −{b} =a

(3)

(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q

25 © Dayu Yuan 04/21/23

D(sg0 )∩D(sg1)∩D(sg2 )∩D(sg3)∩D(sg4 )=D(sg2 )∩D(sg4 )

Page 26: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to

g, without being isomorphic to any other features Indirect Set:

Lindex: Effective in Filtering

Vd (sg)={g∈D(sg)}

Vi (sg)=D(sg)−Vd(sg)

Data Based Graphs

Index

Why “b” is in the direct value set of “sg1”, but “a” is not?

26 © Dayu Yuan 04/21/23

Page 27: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is

Lindex: Effective in Filtering

C(q)=I iVd( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)

Query “a” Graphs need to be verified

Traditional Model

Strategy(1)

Strategy(1 + 2)

{a,b,c} I {a,b,c}={a,b,c}

{a,b,c} I {a,b,c}−{c} ={a,b}

{a,c} I {a}−{c} ={a}

c b

27 © Dayu Yuan 04/21/23

Page 28: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as:

<ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> the label of the graph sg2 is

< 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is

< 1,2,6,1,7 >

Then subgraph g2 can be stored

as just < 1, 3, 6, 2, 6 >< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

28 © Dayu Yuan 04/21/23

Page 29: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Lindex: Empirical Evaluation of Memory

Index\Feature

DFG ∆TCFG MimRTree+

∆DFT

Feature Count

7599/6238

9873/5712

50006172/3

87500/61

72Gindex 1359 1534 1348 1339FGindex 1826

SwiftIndex 860Lindex 677 841 772 676 671

Unit in KB

29 © Dayu Yuan 04/21/23

Page 30: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 >

the label of its chosen parent sg1 is < 1,2,6,1,7 >

Node1 of sg1 mapped to Node1 of sg2

Lindex: Efficient in Maxsub Feature Search

< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matcheswhile traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q.

30 © Dayu Yuan 04/21/23

Page 31: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Lindex: Efficient in Minsup Feature Search

< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice.

31 © Dayu Yuan 04/21/23

Page 32: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Lindex: Experiments

Exp on AIDS Dataset: 40,000 Graphs

32 © Dayu Yuan 04/21/23

Page 33: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Exp on AIDS Dataset: 40,000 Graphs

Lindex: Experiments

33 © Dayu Yuan 04/21/23

Page 34: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Exp on AIDS Dataset: 40,000 Graphs

Lindex: Experiments

34 © Dayu Yuan 04/21/23

Page 35: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Outline 1. Motivation & Introduction 2. Graph index design for subgraph search

(WebDB 2011)

3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search

Problem Definition & Objective Function Branch & Bound Heuristic-based search space exploration:

Partition of the search space Experiment Results

4. Future work

04/21/23© Dayu Yuan35

Page 36: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Motivation All previous feature selection algorithms for

“subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum

support, etc) Our Contributions: First direct feature mining algorithm for the

subgraph search problem Effective in index updating Choose high quality features

36 © Dayu Yuan 04/21/23

Page 37: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0

(1) Remove the Least Useful Feature (2) Add a New Feature (3) Goes to (1) until it converge

37 © Dayu Yuan 04/21/23

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

b

d

q1

ab

c

p1

c

b

d

p2

P1: g1, g2, g3, g4

P3: g1, g4

P2: g2, g3, g4, g5

P4: g2, g3b

c d

ap3

aa

b

d

p4

ac

b

d

g2

da

c

g3

c

b d

P0

Page 38: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Problem Definition

Previous work: Given a graph database D, find a set of subgraph

(subtree) features, minimizing the response time over training queries Q.

Trsp (q)=Tfilter (q) +Tverf (q,C(q))

Tresp (q)≈Tverf (q,C(q)) ∝0 q∈P

|C(q) |=|I npi ∈maxSub(q)D(pi ) | q∉P

⎧⎨⎪

⎩⎪

P =argmin|P|=N

|C(q,P) |q∈Q∑

gain(p,P0 )= |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑

p =argmaxp

gain(p,P0 )

38 © Dayu Yuan 04/21/23

Our work: Given a graph database D, an already built

index I with feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time

Page 39: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0

(1) Remove the Least Useful Feature Find a feature p in P0

(2) Add a New Feature Find a new feature p

(3) Goes to (1)

C(q,P)=I n

Xq[ i ]=1D(pi ) =I pi ∈maxSub(q,P )

n D(pi )

p =argmaxp

( |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑ )

p =argminp∈P0

( |C(q,{P0 \ p} |q∈Q∑ − |C(q,P0 ) |

q∈Q∑ )

Po =Po −p

Po =Po + p

39 © Dayu Yuan 04/21/23

Page 40: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e.

Gindex, FGindex) depends on queries too. [Implicitly]

(2) Feature selected are “discriminative” Previous work:

the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.

Our objective function: discriminative power is measure w.r.t P0

(3) Computation Issue (next slides):

40 © Dayu Yuan 04/21/23

Page 41: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: More on the Object Function (cont.)gain(p,P0 )= |C(q,P0 ) |

q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑

Q

{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}

MinSup Queries(p, Q)

gain(p,P0 )= (|C(q,P0 ) |q∈minSup(q,Q)∑ −|C(q,{ p,P0} |) + I (p=q)(|C(q,P0 ) |

q∈Q∑

gain(p,P0 )= (|C(q,P0 ) |

q∈minSup(q,Q)∑ −C(q,P0 ) I D(p) |) + I (p=q)(|C(q,P0 ) |

q∈Q∑

Computing D(p) for each enumerated feature ‘p’ is expensive41 © Dayu Yuan 04/21/23

maxSub(q,S)={gi ∈S|gi ⊂q,¬∃x∈S s.t.gi ⊂ x⊂q}minSup(p,Q) ={q∈Q |p∈maxSub(q,S)}

Example

C(q,P)=I n

Xq [ i ]=1D(pi ) =I pi ∈maxSub(q,P )

n D(pi )

Page 42: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p,

has an easy to compute upper bound and lower bound:

Upp(p,P0 )=1|Q |

|C(q,P0 )−D(q) |+1|Q |

I (p=q) |C(q,P0 ) |q∈Q∑

q∈minSup(p,Q)∑

Low(p,P0 ) =1|Q |

|C(q,P0 )−1γ

D(maxSub(p)) |+1|Q |

I (p=q) |C(q,P0 ) |q∈Q∑

q∈minSup(p,Q)∑

Inexpensive to compute Two approaches to estimate

(1) Lazy calculation: don’t have to calculate gain(p, P0) when

Upp(p, P0) < gain(p*, P0)

Low(p, P0) > gain(p*, P0)

(2)

gain(p,P0 )=α ×Upp(p,P0 ) + (1−α)Upp(p,P0 )

Omit Prof

42 © Dayu Yuan 04/21/23

Page 43: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Challenges (1) Exponential search space for the new

index subgraph feature “p”. (2) Objective function is neither monotonic nor

anti-monotonic. [Apriori rule can not be used] (3) Traditional heuristic-based graph feature

mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”)

43 © Dayu Yuan 04/21/23

Page 44: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string,

the DFS tree is a pre-fix tree of the labels of graphs.

n1

n2

n3 n4

n5

n6

n7

For each branch, e.g., branch starting from n5, find an branch upper bound > gain value of all nodes on that branch.

The objective function is neither monotonic or anti-monotonic

44 © Dayu Yuan 04/21/23

Page 45: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Branch and Bound Thm

For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0)

n1

n2

n3 n4

n5

n6

n7

45 © Dayu Yuan 04/21/23

BUpp(p)=1Q{ |C(q,P0 )−D(q) |+

q∈Q,q⊃p∑ maxp'⊃p |C(p') | I (q=p')

q∈Q∑ }

Omit Prof

Page 46: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Heuristic based search space partition Problem:

The search always starts from the same root and search according to the same order

Observation The new graph pattern p must be a super graph of some patterns in P0,

i.e., p ⊃ p2 in Figure 41) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important.

Spo int(r)= |C(q,P0 )−D(q) |q∈minSup(r,Q)∑ +maxp'⊃r |C(p') | I (q=p')

q∈minSup(r,Q)∑

46 © Dayu Yuan 04/21/23

Page 47: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Heuristic based search space partition Procedure:

(1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating

For i=1to|P| doIf branch upper bound of BUpp(ri) < gain(p∗) then breakElse Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗)If gain(p*(r)) > gain(p∗), update p∗ = p∗(r)

Discussion: (1) Candidate features are enumerated as descendents of the

“root”. (Partition of the search space) (3) The “root” is visited according to sPoint(r) score; quick to

find a close to optimal feature. (2) Candidate features are ‘frequent’ on D(r), not all D

Smaller minimum support (4) Top k feature selection

47 © Dayu Yuan 04/21/23

Page 48: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Experiment

48 © Dayu Yuan 04/21/23

The AIDS dataset D (40K chemical molecules), Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02

[1175 new feature are added]

Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration

Achieving the same candidate set size decrease

Page 49: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

04/21/23© Dayu Yuan49

Page 50: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Experiment

50 © Dayu Yuan 04/21/23

Page 51: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Experiment

51 © Dayu Yuan 04/21/23

2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on

DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration

Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable

Page 52: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Experiments

DF VS. iterative methods

52 © Dayu Yuan 04/21/23

Page 53: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Experiments

53 © Dayu Yuan 04/21/23

Page 54: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Feature Mining: Experiments

TCFG VS. iterative methods

MimR VS. iterative methods

54 © Dayu Yuan 04/21/23

Page 55: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Outline 1. Motivation & Introduction 2. Graph index design for subgraph search

(WebDB 2011)

3. Graph feature mining for subgraph search (In submission)

4. Future work Supergraph Search

Index structure Feature Selection for Supergraph Search

Graph feature mining for classification Time line

04/21/23© Dayu Yuan55

Page 56: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: supergraph search

04/21/23© Dayu Yuan56

Problem Definition: In a graph database D = {g1,g2,...gn}, given a query

graph q, the supergraph search algorithm returns all database graphs have q as a supergraph.

Exclusive Logic Filtering: For any subgraph features that are not

contained in the query, their supporting sets can be filtered.

In Memory Model: Too many disk operations if the postings are stored on disk

Page 57: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: supergraph search On Disk Model & Feature Selection

Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint

value set)

The value set of feature f Query processing model:

04/21/23© Dayu Yuan57

C(q)=U f∈F (q)valueSet( f )UvalueSet(φ)

C(q) = valueSet( f )f∈F (q)

∑ + valueSet(φ)

ValueSet( f )∩ValueSet(p) =φ

Valuset( f )⊆D( f )− D(p)

p⊃ fU

Sg1

Sg3

Sg2

Value SetSg1:[ a]Sg2:[ c]Sg3:[ b]

b

a

c

q

Page 58: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: supergraph search On Disk Model & Feature Selection

Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint

value set)

The value set of feature f Query processing model:

04/21/23© Dayu Yuan58

C(q)=U f∈F (q)valueSet( f )UvalueSet(φ)

C(q) = valueSet( f )f∈F (q)

∑ + valueSet(φ)

ValueSet( f )∩ValueSet(p) =φ

Valuset( f )⊆D( f )− D(p)

p⊃ fU

Sg1

Sg3

Sg2

Value SetSg1:[]Sg2:[ a,c]Sg3:[ b]

b

a

c

q

Page 59: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: supergraph search On Disk Model & Feature Selection

04/21/23© Dayu Yuan59

Page 60: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: supergraph search On Disk Model & Feature Selection

04/21/23© Dayu Yuan60

Tentative Solution to this problem:(1)Banking float maximization problem (similar to max cover problem). NP-hard problem. Polynomial solution exists with approximate ratio (1-1/e)(2)Scalability issues:

(a) Solve the problem with Map-reduce (map-reduce solution exists for the max cover problem [63, 65])(b) Direct Feature Mining

Page 61: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classification

04/21/23© Dayu Yuan61

Feature vector for graph data: Given a set of N graph features p1 , p2 , ..., pn , a

graph g can be coded as a vector Xg=[x1, x2, ..., xn]T where xi is a {0, 1} valued variable and xi = 1 if and only if pi ⊆ g for all i.

g1 =[1,1,1,0]T

g2 =[1,0,1,1]T

g3 =[0,1,1,0]T

q =[1,0,1,0]T

Compared with Graph Kernel::

Interpretability

Graph Vectors based on feature p1, p2, p3 and p4

Page 62: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classification Previous Work: (1) Class two algorithms: [batch mode] (2) Class three algorithms: [direct mining]

Iterative mining: at each iteration, only feature is selected

Redundancy: a feature f with very low objective function value in iteration i, may be enumerated in the following iterations as well.

(3) Heuristic based feature mining: The search space of subgraph features is explored

randomly. Close to optimal features tend to be discovered quickly. (Much faster than exact search)

Why not use other descriptors instead of subgraphs as features?

04/21/23© Dayu Yuan62

Page 63: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classification Direction One: Descriptor Mining

04/21/23© Dayu Yuan63

Tentative Solution:Use the information diffusion models to build a descriptor capturing the local context and topology information.

Page 64: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classification Direction One: Descriptor Mining What is Information Diffusion? (Word of Mouth

effect) Different models

General Assumption Nodes are either active or inactive; Active nodes may cause

other to activate; active nodes never deactivate Linear Threshold Model [69] Independent Cascade Model [70]

Build a descriptor: (1) For each node v, activate only v and trigger the

information diffusion procedure. (2) Collect all active nodes to build the descriptor.

04/21/23© Dayu Yuan64

Page 65: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Information Diffusion Models Linear Threshold Model [69] A node v is influenced by each neighbor w with

weight b(v, w)

Node v will be activated if is a random variable in [0, 1]

Independent Cascade Model [70] When node v is activated, it has a one and only one

chance of activating its inactive neighbor w. The activation attempt succeeds with probability pvw

04/21/23© Dayu Yuan65

O(v)≥δv

δv

Page 66: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classificationBuild a descriptor:

(1) For each node v, activate only v and trigger the information diffusion.

(2) Collect all active nodes to build the descriptor.

(3) P1 denotes the probability of a node with label `a’ becomes active

04/21/23© Dayu Yuan66

Des(v, l)Node v, label l

DesA Probability

Label a p1=1

Label b p2=.8

Label c p3=.9

Label d p4=.4

……

DesB Probability

Label a P1=0.5

Label b p2=.4

Label c p3=.3

Label d p4=.7

……

Page 67: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classification Mining descriptors for graph classification Convert to the probabilistic item set mining

problem. Future plan (1) Try different information diffusion models to

see their cons and pros on modeling the `local’ context and topology information.

(2) Explore discriminative probabilistic item-set mining algorithms for (binary) classification.

04/21/23© Dayu Yuan67

Page 68: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: graph classification Direction Two: Irredundant Iterative Feature Mining Embedding Fading approach:

04/21/23© Dayu Yuan68

Page 69: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Future work: time line

04/21/23© Dayu Yuan69

Page 70: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

70

Thanks

Questions?

© Dayu Yuan 04/21/23

Page 71: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Reference

04/21/23© Dayu Yuan71

For a complete list of references, please refer to

http://cxs02.ist.psu.edu:8090/Blog/Proposal.html

Page 72: Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Backup Slides

04/21/23© Dayu Yuan72

A B

D

A

B

C

A

C

11

2

3

2

3 4

5

A B

D

1 2

3

A B

C

1 2

3

A

A

Features Database Graph Query Graph

A B

C

1 2

3

f1

f2

g1

g2 q