Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Graph Feature Mining for Indexing and Classification

A Thesis Proposal in Department of Computer Science and Engineering

by Dayu Yuan

The Pennsylvania State University

1 © Dayu Yuan 04/21/23

Welcome Committee Chairs: Dr. Prasenjit Mitra Dr. C. Lee Giles

CSE Department Faculty Members: Dr. Jessy Barlow Dr. Daniel Kifer

Outside Member: Dr. Zan Huang

04/21/23© Dayu Yuan2

Outline 1. Motivation & Introduction 2. Graph index design for subgraph search

(WebDB 2011)

3. Graph feature mining for subgraph search (In submission)

4. Future work


Motivation: Graphs are prevalent: Chemistry: Chemical Molecule Biology: Protein Structure Computer Aided Design: Image Processing & Computer Vision: Social Network:


Above figures all come from internet

Mine and Manage Graph Data

Our focus Data: Graph Database: a collection of graphs Graph Scale: hundreds of nodes & edges

Chemical Molecules Small Social Communities Mechanical Parts

Graph Type: labeled, connected, undirected graph (can be extend to other types)


ac

b

d c d

g2 g4

ac

g3

b

b a

c

b dda

b

c d

g5

a

a

ab

c d

ag1

c

d

Graph Feature/Pattern Nothing but Subgraphs/Subtrees/Random walk paths Exponential number of subgraphs in a graph database

Impossible to enumerate Frequent subgraphs are popular


ac

b

d c d

g2 g4

ac

g3

b

b a

c

b dda

b

c d

g5

b

c d

aa

b

c c

b

d

P1 P2 P3

a

b

c c

dP4

a

ab

c d

ag1

c

d

Features

Graph Database

Graph Feature/Pattern


ac

b

d c d

g2 g4

ac

g3

b

b a

c

b dda

b

c d

g5

ab

c

P1

c

b

d

P3

a

b

c d

aP2

b

c c

dP4

a

ab

c d

ag1

c

d

Features

Graph Database

P1: g1, g2, g3, g4 P2: g1, g4

P3: g2, g3, g4, g5 P4: g1, g3

Graph Features in Subgraph Search Subgraph Search: In a graph database D = {g1,g2,...gn}, given a query

graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.


ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

ac

b

dq1

Graph Features in Subgraph Search Filtering + Verification (gIndex 04): If a graph g contains the query q, then g has to

contain all q’s subgraphs.


ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

ac

b

dq1

ab

c

P1

c

b

d

P3

b

c d

aP2

b

c c

dP4

FeaturesP1: g1, g2, g3, g4

P2: g1, g4

P3: g2, g3, g4, g5

P4: g1, g3

Graph Features in Supergraph Search

04/21/23© Dayu Yuan10

Supergraph Search: In a graph database D = {g1,g2,...gn}, given a query

graph q, the supergraph search algorithm returns all database graphs have q as a supergraph.

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

c dq2

c

b ca

a c

Graph Features in Supergraph Search

04/21/23© Dayu Yuan11

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

Filter + Verification (cIndex 07): For any subgraph features that are not

contained in the query, their supporting sets can be filtered.

ab

c

P1

c

b

d

P3

b

c d

aP2

b

c c

dP4

FeaturesP1: g1, g2, g3, g4

P2: g1, g4

P3: g2, g3, g4, g5

P4: g1, g3

c dq2

c

b ca

a c

Graph Features in Graph Classification Graph Classification: Protein activity prediction Drug toxicity prediction Image classification

Graph Kernel: Hard to interpret the results and the

rules

04/21/23© Dayu Yuan12

ac

b

d

g2

d

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

g3

c

b d

P1: g1, g2, g3, g4

P2: g1, g4

P3: g2, g3, g4, g5

P4: g1, g3

[1,1,0,1]T [1,0,1,0]T [1,0,1,1]T[1,1,1,0]T [0,0,1,0]T

Graph Feature Mining Motivation Graph query operations (subgraph/supergraph

search) need features to build the index Graph learning needs features to explicitly

represent graphs as vectors Challenges: Indexing: limited memory Classification: curse of dimensionality Exponential number of subgraphs Too many frequent subgraphs, most of them are

redundant & not discriminative/informative

04/21/23© Dayu Yuan13

Research in Graph Feature Mining

04/21/23© Dayu Yuan14

1. Mine Frequent Subgraphs:(1) Not computational efficient(2) Curse of Dimensionality(3) Most frequent subgraphs are redundant or not

discrimiantive.

04/21/23© Dayu Yuan15

2. Batch mode discriminative & frequent subgraph mining(A)First enumerate all frequent subgraphs [Bottleneck](B)Mine discriminative & frequent subgraphs out of frequent subgraphs

Challenge:How to set the minimum support in step (A)? Small? Big?



04/21/23© Dayu Yuan16

3. Direct Feature Mining: A. Search for a feature f optimizing an objective functionB. Find K features: run above algorithm iteratively (forward feature selection)

Proposal: Our initial study on Subgraph Search Graph Index Structure Design Graph Feature Mining

Future work plan Supergraph Search

Graph Index Structure Design (initial study) Graph Feature Mining

Graph Classification Graph Descriptor Mining for Classification Irredundant Iterative Feature Mining

04/21/23© Dayu Yuan17


(WebDB 2011) Background of Subgraph Search:

Preliminary & Problem Definition Filter + Verification [Feature Based Index Approach]

3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search

4. Future work

04/21/23© Dayu Yuan18

Problem Definition: In a graph database D = {g1,g2,...gn}, given a query

graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.

Solution: Brute force: For each query q, scan the dataset, find

D(q) Filter + Verification:

Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q)

Subgraph Search: Definition

Dq

C(q) D(q)

C(q) = D

19 © Dayu Yuan 04/21/23

Filter + Verification: Rule:

If a graph g contains the query q, then g has to contain all q’s subgraphs.

Inverted Index: <Key, Value> pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the

“key” subgraph)

Subgraph Search: Solutions

20 © Dayu Yuan 04/21/23

Total Query Processing Time: (1) filtering cost: D to C(q)

Cost of the search for subgraph features contained in the query

Cost of loading the postings file, cost of intersecting the postings

(2) verification cost: C(q) to D(q) subgraph isomorphism tests

NP-complete, dominates overall cost Related work: Reduce the verification cost by mining subgraph

features Disadvantages: (1) “Batch mode” feature mining (2) Different index structure designed for different features

Subgraph Search: Related Work

21 © Dayu Yuan 04/21/23


(WebDB 2011) Background of Subgraph Search: Lindex: A general index structure for subsearch

Effective (filtering power) Efficient (response time) Compact (memory consumption) Experiment Results


4. Future work 04/21/23© Dayu Yuan22

Lindex: A general index structure

Contributions: Orthogonal to related work (feature mining) Applicable to all subgraph/subtree features. Lindex decouples

feature mining and index structure design Compact, Effective and Efficient

23 © Dayu Yuan 04/21/23

Lattice-like Index:(1)Organize indexing features with a lattice(2)Partition the value set (supporting set/ posting

list)

Definition (maxSub, minSuper). S is all indexing features

Lindex: Effective in Filtering

maxSub(g,S)={gi ∈S|gi ⊂ g,¬∃x∈Ss.t.gi ⊂ x⊂ g}minSup(g,S) ={gi ∈S|g⊂ gi ,¬∃x∈Ss.t.g⊂ x⊂ gi}

(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q

24 © Dayu Yuan 04/21/23

back

Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on

which an algorithm should check for subgraph isomorphism is


C(q)=I i D( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)

C(q)=D(sg2 ) I D(sg4 )−D(sg5 )={a,b,c} I {a,b,d} −{b}={a,b} −{b} =a

(3)

(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q

25 © Dayu Yuan 04/21/23

D(sg0 )∩D(sg1)∩D(sg2 )∩D(sg3)∩D(sg4 )=D(sg2 )∩D(sg4 )

Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to

g, without being isomorphic to any other features Indirect Set:


Vd (sg)={g∈D(sg)}

Vi (sg)=D(sg)−Vd(sg)

Data Based Graphs

Index

Why “b” is in the direct value set of “sg1”, but “a” is not?

26 © Dayu Yuan 04/21/23

Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is


C(q)=I iVd( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)

Query “a” Graphs need to be verified

Traditional Model

Strategy(1)

Strategy(1 + 2)

{a,b,c} I {a,b,c}={a,b,c}

{a,b,c} I {a,b,c}−{c} ={a,b}

{a,c} I {a}−{c} ={a}

c b

27 © Dayu Yuan 04/21/23

Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as:

<ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> the label of the graph sg2 is

< 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is

< 1,2,6,1,7 >

Then subgraph g2 can be stored

as just < 1, 3, 6, 2, 6 >< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

28 © Dayu Yuan 04/21/23

Lindex: Empirical Evaluation of Memory

Index\Feature

DFG ∆TCFG MimRTree+

∆DFT

Feature Count

7599/6238

9873/5712

50006172/3

87500/61

72Gindex 1359 1534 1348 1339FGindex 1826

SwiftIndex 860Lindex 677 841 772 676 671

Unit in KB

29 © Dayu Yuan 04/21/23

the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 >

the label of its chosen parent sg1 is < 1,2,6,1,7 >

Node1 of sg1 mapped to Node1 of sg2

Lindex: Efficient in Maxsub Feature Search

< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matcheswhile traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q.

30 © Dayu Yuan 04/21/23

Lindex: Efficient in Minsup Feature Search

< 1, 3, 6, 2, 6 >

< 1,2,6,1,7 >

The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice.

31 © Dayu Yuan 04/21/23

Lindex: Experiments

Exp on AIDS Dataset: 40,000 Graphs

32 © Dayu Yuan 04/21/23


Lindex: Experiments

33 © Dayu Yuan 04/21/23


Lindex: Experiments

34 © Dayu Yuan 04/21/23


(WebDB 2011)


Problem Definition & Objective Function Branch & Bound Heuristic-based search space exploration:

Partition of the search space Experiment Results

4. Future work

04/21/23© Dayu Yuan35

Feature Mining: Motivation All previous feature selection algorithms for

“subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum

support, etc) Our Contributions: First direct feature mining algorithm for the

subgraph search problem Effective in index updating Choose high quality features

36 © Dayu Yuan 04/21/23

Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0

(1) Remove the Least Useful Feature (2) Add a New Feature (3) Goes to (1) until it converge

37 © Dayu Yuan 04/21/23

c d

g4

b

b aa a

b

c d

g5

a

ab

c d

ag1

c

d

Graph Database: D

ac

b

d

q1

ab

c

p1

c

b

d

p2

P1: g1, g2, g3, g4

P3: g1, g4

P2: g2, g3, g4, g5

P4: g2, g3b

c d

ap3

aa

b

d

p4

ac

b

d

g2

da

c

g3

c

b d

P0

Feature Mining: Problem Definition

Previous work: Given a graph database D, find a set of subgraph

(subtree) features, minimizing the response time over training queries Q.

Trsp (q)=Tfilter (q) +Tverf (q,C(q))

Tresp (q)≈Tverf (q,C(q)) ∝0 q∈P

|C(q) |=|I npi ∈maxSub(q)D(pi ) | q∉P

⎧⎨⎪

⎩⎪

P =argmin|P|=N

|C(q,P) |q∈Q∑

gain(p,P0 )= |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑

p =argmaxp

gain(p,P0 )

38 © Dayu Yuan 04/21/23

Our work: Given a graph database D, an already built

index I with feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time

Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0

(1) Remove the Least Useful Feature Find a feature p in P0

(2) Add a New Feature Find a new feature p

(3) Goes to (1)

C(q,P)=I n

Xq[ i ]=1D(pi ) =I pi ∈maxSub(q,P )

n D(pi )

p =argmaxp

( |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑ )

p =argminp∈P0

( |C(q,{P0 \ p} |q∈Q∑ − |C(q,P0 ) |

q∈Q∑ )

Po =Po −p

Po =Po + p

39 © Dayu Yuan 04/21/23

Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e.

Gindex, FGindex) depends on queries too. [Implicitly]

(2) Feature selected are “discriminative” Previous work:

the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.

Our objective function: discriminative power is measure w.r.t P0

(3) Computation Issue (next slides):

40 © Dayu Yuan 04/21/23

Feature Mining: More on the Object Function (cont.)gain(p,P0 )= |C(q,P0 ) |

q∈Q∑ − |C(q,{ p,P0} |

q∈Q∑

Q

{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}

MinSup Queries(p, Q)

gain(p,P0 )= (|C(q,P0 ) |q∈minSup(q,Q)∑ −|C(q,{ p,P0} |) + I (p=q)(|C(q,P0 ) |

q∈Q∑

gain(p,P0 )= (|C(q,P0 ) |

q∈minSup(q,Q)∑ −C(q,P0 ) I D(p) |) + I (p=q)(|C(q,P0 ) |

q∈Q∑

Computing D(p) for each enumerated feature ‘p’ is expensive41 © Dayu Yuan 04/21/23

maxSub(q,S)={gi ∈S|gi ⊂q,¬∃x∈S s.t.gi ⊂ x⊂q}minSup(p,Q) ={q∈Q |p∈maxSub(q,S)}

Example

C(q,P)=I n

Xq [ i ]=1D(pi ) =I pi ∈maxSub(q,P )

n D(pi )

Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p,

has an easy to compute upper bound and lower bound:

Upp(p,P0 )=1|Q |

|C(q,P0 )−D(q) |+1|Q |

I (p=q) |C(q,P0 ) |q∈Q∑

q∈minSup(p,Q)∑

Low(p,P0 ) =1|Q |

|C(q,P0 )−1γ

D(maxSub(p)) |+1|Q |

I (p=q) |C(q,P0 ) |q∈Q∑

q∈minSup(p,Q)∑

Inexpensive to compute Two approaches to estimate

(1) Lazy calculation: don’t have to calculate gain(p, P0) when

Upp(p, P0) < gain(p*, P0)

Low(p, P0) > gain(p*, P0)

(2)

gain(p,P0 )=α ×Upp(p,P0 ) + (1−α)Upp(p,P0 )

Omit Prof

42 © Dayu Yuan 04/21/23

Feature Mining: Challenges (1) Exponential search space for the new

index subgraph feature “p”. (2) Objective function is neither monotonic nor

anti-monotonic. [Apriori rule can not be used] (3) Traditional heuristic-based graph feature

mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”)

43 © Dayu Yuan 04/21/23

Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string,

the DFS tree is a pre-fix tree of the labels of graphs.

n1

n2

n3 n4

n5

n6

n7

For each branch, e.g., branch starting from n5, find an branch upper bound > gain value of all nodes on that branch.

The objective function is neither monotonic or anti-monotonic

44 © Dayu Yuan 04/21/23

Feature Mining: Branch and Bound Thm

For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0)

n1

n2

n3 n4

n5

n6

n7

45 © Dayu Yuan 04/21/23

BUpp(p)=1Q{ |C(q,P0 )−D(q) |+

q∈Q,q⊃p∑ maxp'⊃p |C(p') | I (q=p')

q∈Q∑ }

Omit Prof

Feature Mining: Heuristic based search space partition Problem:

The search always starts from the same root and search according to the same order

Observation The new graph pattern p must be a super graph of some patterns in P0,

i.e., p ⊃ p2 in Figure 41) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important.

Spo int(r)= |C(q,P0 )−D(q) |q∈minSup(r,Q)∑ +maxp'⊃r |C(p') | I (q=p')

q∈minSup(r,Q)∑

46 © Dayu Yuan 04/21/23

Feature Mining: Heuristic based search space partition Procedure:

(1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating

For i=1to|P| doIf branch upper bound of BUpp(ri) < gain(p∗) then breakElse Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗)If gain(p*(r)) > gain(p∗), update p∗ = p∗(r)

Discussion: (1) Candidate features are enumerated as descendents of the

“root”. (Partition of the search space) (3) The “root” is visited according to sPoint(r) score; quick to

find a close to optimal feature. (2) Candidate features are ‘frequent’ on D(r), not all D

Smaller minimum support (4) Top k feature selection

47 © Dayu Yuan 04/21/23

Feature Mining: Experiment

48 © Dayu Yuan 04/21/23

The AIDS dataset D (40K chemical molecules), Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02

[1175 new feature are added]

Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration

Achieving the same candidate set size decrease

04/21/23© Dayu Yuan49


50 © Dayu Yuan 04/21/23


51 © Dayu Yuan 04/21/23

2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on

DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration

Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable

Feature Mining: Experiments

DF VS. iterative methods

52 © Dayu Yuan 04/21/23


53 © Dayu Yuan 04/21/23


TCFG VS. iterative methods

MimR VS. iterative methods

54 © Dayu Yuan 04/21/23


(WebDB 2011)

3. Graph feature mining for subgraph search (In submission)

4. Future work Supergraph Search

Index structure Feature Selection for Supergraph Search

Graph feature mining for classification Time line

04/21/23© Dayu Yuan55

Future work: supergraph search

04/21/23© Dayu Yuan56

Problem Definition: In a graph database D = {g1,g2,...gn}, given a query

graph q, the supergraph search algorithm returns all database graphs have q as a supergraph.

Exclusive Logic Filtering: For any subgraph features that are not

contained in the query, their supporting sets can be filtered.

In Memory Model: Too many disk operations if the postings are stored on disk

Future work: supergraph search On Disk Model & Feature Selection

Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint

value set)

The value set of feature f Query processing model:

04/21/23© Dayu Yuan57

C(q)=U f∈F (q)valueSet( f )UvalueSet(φ)

C(q) = valueSet( f )f∈F (q)

∑ + valueSet(φ)

ValueSet( f )∩ValueSet(p) =φ

Valuset( f )⊆D( f )− D(p)

p⊃ fU

Sg1

Sg3

Sg2

Value SetSg1：［ a］Sg2：［ c］Sg3：［ b］

b

a

c

q


Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint

value set)

The value set of feature f Query processing model:

04/21/23© Dayu Yuan58

C(q)=U f∈F (q)valueSet( f )UvalueSet(φ)

C(q) = valueSet( f )f∈F (q)

∑ + valueSet(φ)

ValueSet( f )∩ValueSet(p) =φ

Valuset( f )⊆D( f )− D(p)

p⊃ fU

Sg1

Sg3

Sg2

Value SetSg1：［］Sg2：［ a,c］Sg3：［ b］

b

a

c

q


04/21/23© Dayu Yuan59


04/21/23© Dayu Yuan60

Tentative Solution to this problem:(1)Banking float maximization problem (similar to max cover problem). NP-hard problem. Polynomial solution exists with approximate ratio (1-1/e)(2)Scalability issues:

(a) Solve the problem with Map-reduce (map-reduce solution exists for the max cover problem [63, 65])(b) Direct Feature Mining

Future work: graph classification

04/21/23© Dayu Yuan61

Feature vector for graph data: Given a set of N graph features p1 , p2 , ..., pn , a

graph g can be coded as a vector Xg=[x1, x2, ..., xn]T where xi is a {0, 1} valued variable and xi = 1 if and only if pi ⊆ g for all i.

g1 =[1,1,1,0]T

g2 =[1,0,1,1]T

g3 =[0,1,1,0]T

q =[1,0,1,0]T

Compared with Graph Kernel::

Interpretability

Graph Vectors based on feature p1, p2, p3 and p4

Future work: graph classification Previous Work: (1) Class two algorithms: [batch mode] (2) Class three algorithms: [direct mining]

Iterative mining: at each iteration, only feature is selected

Redundancy: a feature f with very low objective function value in iteration i, may be enumerated in the following iterations as well.

(3) Heuristic based feature mining: The search space of subgraph features is explored

randomly. Close to optimal features tend to be discovered quickly. (Much faster than exact search)

Why not use other descriptors instead of subgraphs as features?

04/21/23© Dayu Yuan62

Future work: graph classification Direction One: Descriptor Mining

04/21/23© Dayu Yuan63

Tentative Solution:Use the information diffusion models to build a descriptor capturing the local context and topology information.

Future work: graph classification Direction One: Descriptor Mining What is Information Diffusion? (Word of Mouth

effect) Different models

General Assumption Nodes are either active or inactive; Active nodes may cause

other to activate; active nodes never deactivate Linear Threshold Model [69] Independent Cascade Model [70]

Build a descriptor: (1) For each node v, activate only v and trigger the

information diffusion procedure. (2) Collect all active nodes to build the descriptor.

04/21/23© Dayu Yuan64

Information Diffusion Models Linear Threshold Model [69] A node v is influenced by each neighbor w with

weight b(v, w)

Node v will be activated if is a random variable in [0, 1]

Independent Cascade Model [70] When node v is activated, it has a one and only one

chance of activating its inactive neighbor w. The activation attempt succeeds with probability pvw

04/21/23© Dayu Yuan65

O(v)≥δv

δv

Future work: graph classificationBuild a descriptor:

(1) For each node v, activate only v and trigger the information diffusion.

(2) Collect all active nodes to build the descriptor.

(3) P1 denotes the probability of a node with label `a’ becomes active

04/21/23© Dayu Yuan66

Des(v, l)Node v, label l

DesA Probability

Label a p1=1

Label b p2=.8

Label c p3=.9

Label d p4=.4

……

DesB Probability

Label a P1=0.5

Label b p2=.4

Label c p3=.3

Label d p4=.7

……

Future work: graph classification Mining descriptors for graph classification Convert to the probabilistic item set mining

problem. Future plan (1) Try different information diffusion models to

see their cons and pros on modeling the `local’ context and topology information.

(2) Explore discriminative probabilistic item-set mining algorithms for (binary) classification.

04/21/23© Dayu Yuan67

Future work: graph classification Direction Two: Irredundant Iterative Feature Mining Embedding Fading approach:

04/21/23© Dayu Yuan68

Future work: time line

04/21/23© Dayu Yuan69

70

Thanks

Questions?

© Dayu Yuan 04/21/23

Reference

04/21/23© Dayu Yuan71

For a complete list of references, please refer to

http://cxs02.ist.psu.edu:8090/Blog/Proposal.html

Backup Slides

04/21/23© Dayu Yuan72

A B

D

A

B

C

A

C

11

2

3

2

3 4

5

A B

D

1 2

3

A B

C

1 2

3

A

A

Features Database Graph Query Graph

A B

C

1 2

3

f1

f2

g1

g2 q

Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania.

Documents