Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania State University 1 © Dayu Yuan 12/31/21
Graph Feature Mining for Indexing and Classification
A Thesis Proposal in Department of Computer Science and Engineering
by Dayu Yuan
The Pennsylvania State University
1 © Dayu Yuan 04/21/23
Welcome Committee Chairs: Dr. Prasenjit Mitra Dr. C. Lee Giles
CSE Department Faculty Members: Dr. Jessy Barlow Dr. Daniel Kifer
Outside Member: Dr. Zan Huang
04/21/23© Dayu Yuan2
Outline 1. Motivation & Introduction 2. Graph index design for subgraph search
(WebDB 2011)
3. Graph feature mining for subgraph search (In submission)
4. Future work
04/21/23© Dayu Yuan3
Motivation: Graphs are prevalent: Chemistry: Chemical Molecule Biology: Protein Structure Computer Aided Design: Image Processing & Computer Vision: Social Network:
04/21/23© Dayu Yuan4
Above figures all come from internet
Mine and Manage Graph Data
Our focus Data: Graph Database: a collection of graphs Graph Scale: hundreds of nodes & edges
Chemical Molecules Small Social Communities Mechanical Parts
Graph Type: labeled, connected, undirected graph (can be extend to other types)
04/21/23© Dayu Yuan5
ac
b
d c d
g2 g4
ac
g3
b
b a
c
b dda
b
c d
g5
a
a
ab
c d
ag1
c
d
Graph Feature/Pattern Nothing but Subgraphs/Subtrees/Random walk paths Exponential number of subgraphs in a graph database
Impossible to enumerate Frequent subgraphs are popular
04/21/23© Dayu Yuan6
ac
b
d c d
g2 g4
ac
g3
b
b a
c
b dda
b
c d
g5
b
c d
aa
b
c c
b
d
P1 P2 P3
a
b
c c
dP4
a
ab
c d
ag1
c
d
Features
Graph Database
Graph Feature/Pattern
04/21/23© Dayu Yuan7
ac
b
d c d
g2 g4
ac
g3
b
b a
c
b dda
b
c d
g5
ab
c
P1
c
b
d
P3
a
b
c d
aP2
b
c c
dP4
a
ab
c d
ag1
c
d
Features
Graph Database
P1: g1, g2, g3, g4 P2: g1, g4
P3: g2, g3, g4, g5 P4: g1, g3
Graph Features in Subgraph Search Subgraph Search: In a graph database D = {g1,g2,...gn}, given a query
graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.
04/21/23© Dayu Yuan8
ac
b
d
g2
d
c d
g4
b
b aa a
b
c d
g5
a
ab
c d
ag1
c
d
Graph Database: D
ac
g3
c
b d
ac
b
dq1
Graph Features in Subgraph Search Filtering + Verification (gIndex 04): If a graph g contains the query q, then g has to
contain all q’s subgraphs.
04/21/23© Dayu Yuan9
ac
b
d
g2
d
c d
g4
b
b aa a
b
c d
g5
a
ab
c d
ag1
c
d
Graph Database: D
ac
g3
c
b d
ac
b
dq1
ab
c
P1
c
b
d
P3
b
c d
aP2
b
c c
dP4
FeaturesP1: g1, g2, g3, g4
P2: g1, g4
P3: g2, g3, g4, g5
P4: g1, g3
Graph Features in Supergraph Search
04/21/23© Dayu Yuan10
Supergraph Search: In a graph database D = {g1,g2,...gn}, given a query
graph q, the supergraph search algorithm returns all database graphs have q as a supergraph.
ac
b
d
g2
d
c d
g4
b
b aa a
b
c d
g5
a
ab
c d
ag1
c
d
Graph Database: D
ac
g3
c
b d
c dq2
c
b ca
a c
Graph Features in Supergraph Search
04/21/23© Dayu Yuan11
ac
b
d
g2
d
c d
g4
b
b aa a
b
c d
g5
a
ab
c d
ag1
c
d
Graph Database: D
ac
g3
c
b d
Filter + Verification (cIndex 07): For any subgraph features that are not
contained in the query, their supporting sets can be filtered.
ab
c
P1
c
b
d
P3
b
c d
aP2
b
c c
dP4
FeaturesP1: g1, g2, g3, g4
P2: g1, g4
P3: g2, g3, g4, g5
P4: g1, g3
c dq2
c
b ca
a c
Graph Features in Graph Classification Graph Classification: Protein activity prediction Drug toxicity prediction Image classification
Graph Kernel: Hard to interpret the results and the
rules
04/21/23© Dayu Yuan12
ac
b
d
g2
d
c d
g4
b
b aa a
b
c d
g5
a
ab
c d
ag1
c
d
Graph Database: D
ac
g3
c
b d
P1: g1, g2, g3, g4
P2: g1, g4
P3: g2, g3, g4, g5
P4: g1, g3
[1,1,0,1]T [1,0,1,0]T [1,0,1,1]T[1,1,1,0]T [0,0,1,0]T
Graph Feature Mining Motivation Graph query operations (subgraph/supergraph
search) need features to build the index Graph learning needs features to explicitly
represent graphs as vectors Challenges: Indexing: limited memory Classification: curse of dimensionality Exponential number of subgraphs Too many frequent subgraphs, most of them are
redundant & not discriminative/informative
04/21/23© Dayu Yuan13
Research in Graph Feature Mining
04/21/23© Dayu Yuan14
1. Mine Frequent Subgraphs:(1) Not computational efficient(2) Curse of Dimensionality(3) Most frequent subgraphs are redundant or not
discrimiantive.
04/21/23© Dayu Yuan15
2. Batch mode discriminative & frequent subgraph mining(A)First enumerate all frequent subgraphs [Bottleneck](B)Mine discriminative & frequent subgraphs out of frequent subgraphs
Challenge:How to set the minimum support in step (A)? Small? Big?
Research in Graph Feature Mining
Research in Graph Feature Mining
04/21/23© Dayu Yuan16
3. Direct Feature Mining: A. Search for a feature f optimizing an objective functionB. Find K features: run above algorithm iteratively (forward feature selection)
Proposal: Our initial study on Subgraph Search Graph Index Structure Design Graph Feature Mining
Future work plan Supergraph Search
Graph Index Structure Design (initial study) Graph Feature Mining
Graph Classification Graph Descriptor Mining for Classification Irredundant Iterative Feature Mining
04/21/23© Dayu Yuan17
Outline 1. Motivation & Introduction 2. Graph index design for subgraph search
(WebDB 2011) Background of Subgraph Search:
Preliminary & Problem Definition Filter + Verification [Feature Based Index Approach]
3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search
4. Future work
04/21/23© Dayu Yuan18
Problem Definition: In a graph database D = {g1,g2,...gn}, given a query
graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph.
Solution: Brute force: For each query q, scan the dataset, find
D(q) Filter + Verification:
Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q)
Subgraph Search: Definition
Dq
C(q) D(q)
C(q) = D
19 © Dayu Yuan 04/21/23
Filter + Verification: Rule:
If a graph g contains the query q, then g has to contain all q’s subgraphs.
Inverted Index: <Key, Value> pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the
“key” subgraph)
Subgraph Search: Solutions
20 © Dayu Yuan 04/21/23
Total Query Processing Time: (1) filtering cost: D to C(q)
Cost of the search for subgraph features contained in the query
Cost of loading the postings file, cost of intersecting the postings
(2) verification cost: C(q) to D(q) subgraph isomorphism tests
NP-complete, dominates overall cost Related work: Reduce the verification cost by mining subgraph
features Disadvantages: (1) “Batch mode” feature mining (2) Different index structure designed for different features
Subgraph Search: Related Work
21 © Dayu Yuan 04/21/23
Outline 1. Motivation & Introduction 2. Graph index design for subgraph search
(WebDB 2011) Background of Subgraph Search: Lindex: A general index structure for subsearch
Effective (filtering power) Efficient (response time) Compact (memory consumption) Experiment Results
3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search
4. Future work 04/21/23© Dayu Yuan22
Lindex: A general index structure
Contributions: Orthogonal to related work (feature mining) Applicable to all subgraph/subtree features. Lindex decouples
feature mining and index structure design Compact, Effective and Efficient
23 © Dayu Yuan 04/21/23
Lattice-like Index:(1)Organize indexing features with a lattice(2)Partition the value set (supporting set/ posting
list)
Definition (maxSub, minSuper). S is all indexing features
Lindex: Effective in Filtering
maxSub(g,S)={gi ∈S|gi ⊂ g,¬∃x∈Ss.t.gi ⊂ x⊂ g}minSup(g,S) ={gi ∈S|g⊂ gi ,¬∃x∈Ss.t.g⊂ x⊂ gi}
(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q
24 © Dayu Yuan 04/21/23
back
Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on
which an algorithm should check for subgraph isomorphism is
Lindex: Effective in Filtering
C(q)=I i D( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)
C(q)=D(sg2 ) I D(sg4 )−D(sg5 )={a,b,c} I {a,b,d} −{b}={a,b} −{b} =a
(3)
(1) sg2 and sg4 are maxSub of q(2) sg5 is minsup of q
25 © Dayu Yuan 04/21/23
D(sg0 )∩D(sg1)∩D(sg2 )∩D(sg3)∩D(sg4 )=D(sg2 )∩D(sg4 )
Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to
g, without being isomorphic to any other features Indirect Set:
Lindex: Effective in Filtering
Vd (sg)={g∈D(sg)}
Vi (sg)=D(sg)−Vd(sg)
Data Based Graphs
Index
Why “b” is in the direct value set of “sg1”, but “a” is not?
26 © Dayu Yuan 04/21/23
Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is
Lindex: Effective in Filtering
C(q)=I iVd( fi )−U j (hj ), ∀fi ∈maxSub(q), ∀hj ∈minSup(q)
Query “a” Graphs need to be verified
Traditional Model
Strategy(1)
Strategy(1 + 2)
{a,b,c} I {a,b,c}={a,b,c}
{a,b,c} I {a,b,c}−{c} ={a,b}
{a,c} I {a}−{c} ={a}
c b
27 © Dayu Yuan 04/21/23
Lindex: Compact Space Saving (Extension Labeling) Each Edge in a graph is represented as:
<ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> the label of the graph sg2 is
< 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is
< 1,2,6,1,7 >
Then subgraph g2 can be stored
as just < 1, 3, 6, 2, 6 >< 1, 3, 6, 2, 6 >
< 1,2,6,1,7 >
28 © Dayu Yuan 04/21/23
Lindex: Empirical Evaluation of Memory
Index\Feature
DFG ∆TCFG MimRTree+
∆DFT
Feature Count
7599/6238
9873/5712
50006172/3
87500/61
72Gindex 1359 1534 1348 1339FGindex 1826
SwiftIndex 860Lindex 677 841 772 676 671
Unit in KB
29 © Dayu Yuan 04/21/23
the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 >
the label of its chosen parent sg1 is < 1,2,6,1,7 >
Node1 of sg1 mapped to Node1 of sg2
Lindex: Efficient in Maxsub Feature Search
< 1, 3, 6, 2, 6 >
< 1,2,6,1,7 >
instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matcheswhile traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q.
30 © Dayu Yuan 04/21/23
Lindex: Efficient in Minsup Feature Search
< 1, 3, 6, 2, 6 >
< 1,2,6,1,7 >
The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice.
31 © Dayu Yuan 04/21/23
Lindex: Experiments
Exp on AIDS Dataset: 40,000 Graphs
32 © Dayu Yuan 04/21/23
Exp on AIDS Dataset: 40,000 Graphs
Lindex: Experiments
33 © Dayu Yuan 04/21/23
Exp on AIDS Dataset: 40,000 Graphs
Lindex: Experiments
34 © Dayu Yuan 04/21/23
Outline 1. Motivation & Introduction 2. Graph index design for subgraph search
(WebDB 2011)
3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search
Problem Definition & Objective Function Branch & Bound Heuristic-based search space exploration:
Partition of the search space Experiment Results
4. Future work
04/21/23© Dayu Yuan35
Feature Mining: Motivation All previous feature selection algorithms for
“subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum
support, etc) Our Contributions: First direct feature mining algorithm for the
subgraph search problem Effective in index updating Choose high quality features
36 © Dayu Yuan 04/21/23
Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0
(1) Remove the Least Useful Feature (2) Add a New Feature (3) Goes to (1) until it converge
37 © Dayu Yuan 04/21/23
c d
g4
b
b aa a
b
c d
g5
a
ab
c d
ag1
c
d
Graph Database: D
ac
b
d
q1
ab
c
p1
c
b
d
p2
P1: g1, g2, g3, g4
P3: g1, g4
P2: g2, g3, g4, g5
P4: g2, g3b
c d
ap3
aa
b
d
p4
ac
b
d
g2
da
c
g3
c
b d
P0
Feature Mining: Problem Definition
Previous work: Given a graph database D, find a set of subgraph
(subtree) features, minimizing the response time over training queries Q.
Trsp (q)=Tfilter (q) +Tverf (q,C(q))
Tresp (q)≈Tverf (q,C(q)) ∝0 q∈P
|C(q) |=|I npi ∈maxSub(q)D(pi ) | q∉P
⎧⎨⎪
⎩⎪
P =argmin|P|=N
|C(q,P) |q∈Q∑
gain(p,P0 )= |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |
q∈Q∑
p =argmaxp
gain(p,P0 )
38 © Dayu Yuan 04/21/23
Our work: Given a graph database D, an already built
index I with feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time
Feature Mining: Problem Definition Iterative Index Updating: Given database D, current index I with features P0
(1) Remove the Least Useful Feature Find a feature p in P0
(2) Add a New Feature Find a new feature p
(3) Goes to (1)
C(q,P)=I n
Xq[ i ]=1D(pi ) =I pi ∈maxSub(q,P )
n D(pi )
p =argmaxp
( |C(q,P0 ) |q∈Q∑ − |C(q,{ p,P0} |
q∈Q∑ )
p =argminp∈P0
( |C(q,{P0 \ p} |q∈Q∑ − |C(q,P0 ) |
q∈Q∑ )
Po =Po −p
Po =Po + p
39 © Dayu Yuan 04/21/23
Feature Mining: More on the Object Function (1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e.
Gindex, FGindex) depends on queries too. [Implicitly]
(2) Feature selected are “discriminative” Previous work:
the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’.
Our objective function: discriminative power is measure w.r.t P0
(3) Computation Issue (next slides):
40 © Dayu Yuan 04/21/23
Feature Mining: More on the Object Function (cont.)gain(p,P0 )= |C(q,P0 ) |
q∈Q∑ − |C(q,{ p,P0} |
q∈Q∑
Q
{q ∈Q |q=p}{q ∈Q |p∈maxSub(q,{ p,P0})}
MinSup Queries(p, Q)
gain(p,P0 )= (|C(q,P0 ) |q∈minSup(q,Q)∑ −|C(q,{ p,P0} |) + I (p=q)(|C(q,P0 ) |
q∈Q∑
gain(p,P0 )= (|C(q,P0 ) |
q∈minSup(q,Q)∑ −C(q,P0 ) I D(p) |) + I (p=q)(|C(q,P0 ) |
q∈Q∑
Computing D(p) for each enumerated feature ‘p’ is expensive41 © Dayu Yuan 04/21/23
maxSub(q,S)={gi ∈S|gi ⊂q,¬∃x∈S s.t.gi ⊂ x⊂q}minSup(p,Q) ={q∈Q |p∈maxSub(q,S)}
Example
C(q,P)=I n
Xq [ i ]=1D(pi ) =I pi ∈maxSub(q,P )
n D(pi )
Feature Mining: Estimate The Objective Function The objective function of a new subgraph feature p,
has an easy to compute upper bound and lower bound:
Upp(p,P0 )=1|Q |
|C(q,P0 )−D(q) |+1|Q |
I (p=q) |C(q,P0 ) |q∈Q∑
q∈minSup(p,Q)∑
Low(p,P0 ) =1|Q |
|C(q,P0 )−1γ
D(maxSub(p)) |+1|Q |
I (p=q) |C(q,P0 ) |q∈Q∑
q∈minSup(p,Q)∑
Inexpensive to compute Two approaches to estimate
(1) Lazy calculation: don’t have to calculate gain(p, P0) when
Upp(p, P0) < gain(p*, P0)
Low(p, P0) > gain(p*, P0)
(2)
gain(p,P0 )=α ×Upp(p,P0 ) + (1−α)Upp(p,P0 )
Omit Prof
42 © Dayu Yuan 04/21/23
Feature Mining: Challenges (1) Exponential search space for the new
index subgraph feature “p”. (2) Objective function is neither monotonic nor
anti-monotonic. [Apriori rule can not be used] (3) Traditional heuristic-based graph feature
mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”)
43 © Dayu Yuan 04/21/23
Feature Mining: Branch and Bound Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string,
the DFS tree is a pre-fix tree of the labels of graphs.
n1
n2
n3 n4
n5
n6
n7
For each branch, e.g., branch starting from n5, find an branch upper bound > gain value of all nodes on that branch.
The objective function is neither monotonic or anti-monotonic
44 © Dayu Yuan 04/21/23
Feature Mining: Branch and Bound Thm
For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0)
n1
n2
n3 n4
n5
n6
n7
45 © Dayu Yuan 04/21/23
BUpp(p)=1Q{ |C(q,P0 )−D(q) |+
q∈Q,q⊃p∑ maxp'⊃p |C(p') | I (q=p')
q∈Q∑ }
Omit Prof
Feature Mining: Heuristic based search space partition Problem:
The search always starts from the same root and search according to the same order
Observation The new graph pattern p must be a super graph of some patterns in P0,
i.e., p ⊃ p2 in Figure 41) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important.
Spo int(r)= |C(q,P0 )−D(q) |q∈minSup(r,Q)∑ +maxp'⊃r |C(p') | I (q=p')
q∈minSup(r,Q)∑
46 © Dayu Yuan 04/21/23
Feature Mining: Heuristic based search space partition Procedure:
(1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating
For i=1to|P| doIf branch upper bound of BUpp(ri) < gain(p∗) then breakElse Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗)If gain(p*(r)) > gain(p∗), update p∗ = p∗(r)
Discussion: (1) Candidate features are enumerated as descendents of the
“root”. (Partition of the search space) (3) The “root” is visited according to sPoint(r) score; quick to
find a close to optimal feature. (2) Candidate features are ‘frequent’ on D(r), not all D
Smaller minimum support (4) Top k feature selection
47 © Dayu Yuan 04/21/23
Feature Mining: Experiment
48 © Dayu Yuan 04/21/23
The AIDS dataset D (40K chemical molecules), Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02
[1175 new feature are added]
Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration
Achieving the same candidate set size decrease
04/21/23© Dayu Yuan49
Feature Mining: Experiment
50 © Dayu Yuan 04/21/23
Feature Mining: Experiment
51 © Dayu Yuan 04/21/23
2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on
DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration
Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable
Feature Mining: Experiments
DF VS. iterative methods
52 © Dayu Yuan 04/21/23
Feature Mining: Experiments
53 © Dayu Yuan 04/21/23
Feature Mining: Experiments
TCFG VS. iterative methods
MimR VS. iterative methods
54 © Dayu Yuan 04/21/23
Outline 1. Motivation & Introduction 2. Graph index design for subgraph search
(WebDB 2011)
3. Graph feature mining for subgraph search (In submission)
4. Future work Supergraph Search
Index structure Feature Selection for Supergraph Search
Graph feature mining for classification Time line
04/21/23© Dayu Yuan55
Future work: supergraph search
04/21/23© Dayu Yuan56
Problem Definition: In a graph database D = {g1,g2,...gn}, given a query
graph q, the supergraph search algorithm returns all database graphs have q as a supergraph.
Exclusive Logic Filtering: For any subgraph features that are not
contained in the query, their supporting sets can be filtered.
In Memory Model: Too many disk operations if the postings are stored on disk
Future work: supergraph search On Disk Model & Feature Selection
Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint
value set)
The value set of feature f Query processing model:
04/21/23© Dayu Yuan57
C(q)=U f∈F (q)valueSet( f )UvalueSet(φ)
C(q) = valueSet( f )f∈F (q)
∑ + valueSet(φ)
ValueSet( f )∩ValueSet(p) =φ
Valuset( f )⊆D( f )− D(p)
p⊃ fU
Sg1
Sg3
Sg2
Value SetSg1:[ a]Sg2:[ c]Sg3:[ b]
b
a
c
q
Future work: supergraph search On Disk Model & Feature Selection
Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint
value set)
The value set of feature f Query processing model:
04/21/23© Dayu Yuan58
C(q)=U f∈F (q)valueSet( f )UvalueSet(φ)
C(q) = valueSet( f )f∈F (q)
∑ + valueSet(φ)
ValueSet( f )∩ValueSet(p) =φ
Valuset( f )⊆D( f )− D(p)
p⊃ fU
Sg1
Sg3
Sg2
Value SetSg1:[]Sg2:[ a,c]Sg3:[ b]
b
a
c
q
Future work: supergraph search On Disk Model & Feature Selection
04/21/23© Dayu Yuan59
Future work: supergraph search On Disk Model & Feature Selection
04/21/23© Dayu Yuan60
Tentative Solution to this problem:(1)Banking float maximization problem (similar to max cover problem). NP-hard problem. Polynomial solution exists with approximate ratio (1-1/e)(2)Scalability issues:
(a) Solve the problem with Map-reduce (map-reduce solution exists for the max cover problem [63, 65])(b) Direct Feature Mining
Future work: graph classification
04/21/23© Dayu Yuan61
Feature vector for graph data: Given a set of N graph features p1 , p2 , ..., pn , a
graph g can be coded as a vector Xg=[x1, x2, ..., xn]T where xi is a {0, 1} valued variable and xi = 1 if and only if pi ⊆ g for all i.
g1 =[1,1,1,0]T
g2 =[1,0,1,1]T
g3 =[0,1,1,0]T
q =[1,0,1,0]T
Compared with Graph Kernel::
Interpretability
Graph Vectors based on feature p1, p2, p3 and p4
Future work: graph classification Previous Work: (1) Class two algorithms: [batch mode] (2) Class three algorithms: [direct mining]
Iterative mining: at each iteration, only feature is selected
Redundancy: a feature f with very low objective function value in iteration i, may be enumerated in the following iterations as well.
(3) Heuristic based feature mining: The search space of subgraph features is explored
randomly. Close to optimal features tend to be discovered quickly. (Much faster than exact search)
Why not use other descriptors instead of subgraphs as features?
04/21/23© Dayu Yuan62
Future work: graph classification Direction One: Descriptor Mining
04/21/23© Dayu Yuan63
Tentative Solution:Use the information diffusion models to build a descriptor capturing the local context and topology information.
Future work: graph classification Direction One: Descriptor Mining What is Information Diffusion? (Word of Mouth
effect) Different models
General Assumption Nodes are either active or inactive; Active nodes may cause
other to activate; active nodes never deactivate Linear Threshold Model [69] Independent Cascade Model [70]
Build a descriptor: (1) For each node v, activate only v and trigger the
information diffusion procedure. (2) Collect all active nodes to build the descriptor.
04/21/23© Dayu Yuan64
Information Diffusion Models Linear Threshold Model [69] A node v is influenced by each neighbor w with
weight b(v, w)
Node v will be activated if is a random variable in [0, 1]
Independent Cascade Model [70] When node v is activated, it has a one and only one
chance of activating its inactive neighbor w. The activation attempt succeeds with probability pvw
04/21/23© Dayu Yuan65
O(v)≥δv
δv
Future work: graph classificationBuild a descriptor:
(1) For each node v, activate only v and trigger the information diffusion.
(2) Collect all active nodes to build the descriptor.
(3) P1 denotes the probability of a node with label `a’ becomes active
04/21/23© Dayu Yuan66
Des(v, l)Node v, label l
DesA Probability
Label a p1=1
Label b p2=.8
Label c p3=.9
Label d p4=.4
……
DesB Probability
Label a P1=0.5
Label b p2=.4
Label c p3=.3
Label d p4=.7
……
Future work: graph classification Mining descriptors for graph classification Convert to the probabilistic item set mining
problem. Future plan (1) Try different information diffusion models to
see their cons and pros on modeling the `local’ context and topology information.
(2) Explore discriminative probabilistic item-set mining algorithms for (binary) classification.
04/21/23© Dayu Yuan67
Future work: graph classification Direction Two: Irredundant Iterative Feature Mining Embedding Fading approach:
04/21/23© Dayu Yuan68
Future work: time line
04/21/23© Dayu Yuan69
70
Thanks
Questions?
© Dayu Yuan 04/21/23
Reference
04/21/23© Dayu Yuan71
For a complete list of references, please refer to
http://cxs02.ist.psu.edu:8090/Blog/Proposal.html
Backup Slides
04/21/23© Dayu Yuan72
A B
D
A
B
C
A
C
11
2
3
2
3 4
5
A B
D
1 2
3
A B
C
1 2
3
A
A
Features Database Graph Query Graph
A B
C
1 2
3
f1
f2
g1
g2 q