Mohammad Hasan, Mohammed Zaki
RPI, Troy, NY
Consider the following problem from Medical Informatics
Healthy
Diseased
Damaged
Tissue Images
Cell Graphs
Discriminatory Subgraphs
Classifier
404/20/23
Mining Task Dataset
30 graphs Average vertex count: 2154 Average edge count: 36945
Support 40%
Result No Result (used gSpan, Gaston) in a week of
running on 2 GHz dual-core PC with 4 GB running Linux
504/20/23
Limitations of Existing Subgraph Mining Algorithms Work only for small graphs
The most popular datasets in graph mining are chemical graphs Chemical graphs are mostly tree In DTP dataset (most popular dataset) average vertex count is
43 and average edge count is 45
Perform a complete enumeration For large input graph, output set is neither enumerable nor
usable
They follow a fixed enumeration order
Partial run does not efficiently generate the interesting subgraphs
avoid complete enumeration to sample a set of
interesting subgraphs from the output set 604/20/23
Why sampling a solution? Observation 1:
Mining is only exploratory step, mined patterns are generally used in subsequent KD task
Not all frequent patterns are equally important for the desired task at hand
Large output set leads to information overload problem
Observation 2: Traditional mining algorithms explore the output space with a fixed
enumeration order Good for generating non-duplicate candidate patterns But, subsequent patterns in that order are very similar
complete enumeration is generally unnecessary
Sampling can change enumeration order to sample interesting
and non-redundant subgraphs with a higher chance 704/20/23
Output Space Traditional frequent subgraphs for a given support threshold
Can also augment with other constraint To find good patterns for the desired KD task
Input Space
Output Space for FPM with support = 2
904/20/23
Sampling from Output Space
Return a random pattern from the output set
Random pattern is obtained by sampling from a desired distribution
Define an interestingness function, f : FR+; f(p) returns the score of pattern p
The desired sampling distribution is proportional to the interestingness score
If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution
Efficiency consideration Enumerate as few auxiliary patterns as possible
1004/20/23
How to choose f?
Depends on application needs
For exploratory data analysis (EDA), every frequent pattern can have a uniform score
For Top-K pattern mining, support values can be used as scores, which is support biased sampling.
For subgraph summarization task, only maximal graph patterns has uniform non-zero score
For graph classification, discriminatory subgraphs should have high scores
1104/20/23
Challenges
The output space can not be instantiate
Complete statistics about the output space is not known.
Target distribution is not known entirely
Output Space of Graph Mining
g1
g3
g2
g4
g5
s1 s2 s3 sn
GraphsScores
We want, ( ) i
ii
si
s
1304/20/23
MCMC Sampling
In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge
Solution Approach (MCMC Sampling)
Perform random walk in the output space
Represent the output space as a transition graph to allow local transitions
Edges of transition graph are chosen based on structural similarity
Make sure that the random walk is ergodic
POG as transition graph
1404/20/23
Algorithm
Define the transition graph (for instance, POG)
Define interestingness function that select desired sampling distribution
Perform random walk on the transition graph
Compute the neighborhood locally
Compute Transition probability Utilize the interestingness score makes the method generic
Return the currently visiting pattern after k iterations.
1504/20/23
Local Computation of Output Space
g0
Super Patterns
Sub Patterns
Pattern that are not part of the output space is discarded during local neighborhood computation
P01 p02 p03 p04 p05 p00
g1
g2 g3
g5g4
g5g2 g4g3g1 u
Σ =11604/20/23
Compute P to achieve Target Distribution
If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have,
Main task is to choose P, so that the desired stationary distribution is achieved
In fact, we compute only one row of P (local computation)
P
s1 s2 s3 sn
Graphs
Scores
We want,
( ) i
ii
si
s
1704/20/23
Use Metropolis-Hastings (MH) Algorithm
1. Fix an arbitrary proposal distribution beforehand (q)
2. Find a neighbor j (to move to) by using the above distribution
3. Compute acceptance probability and accept the move with this probability
4. If accept move to j; otherwise, go to step 2
1 2 3
0
4 5
q01 q02 q03 q04 q05 q00
Select 3
1,min030
30303 qs
qs
04/20/23
Uniform Sampling of Frequent Patterns
Target Distribution1/n, 1/n, . . . , 1/n
How to achieve it?Use uniform proposal
distributionAcceptance probability is:
dx: Degree of a vertex x
min 1, u
v
d
d
1904/20/23
Uniform Sampling, Transition Probability Matrix
B
A
D
A
DP14
2004/20/23
Discriminatory Subgraph Sampling
Database graphs are labeled
Subgraphs may be used as Feature for supervised classification Graph Kernel
Graph Label
G1G2G3
+1+1-1
Subgraph
Mininggraphs
g1
g2
g3
. .
.
G1
G2
G3Em
beddin
g
Counts
Or
Binar
y
2104/20/23
Sampling in Proportion to Discriminatory Score (f)
Interestingness score (feature quality) Entropy Delta score = abs (positive support – negative
support)
Direct Mining is difficult
Score values (entropy, delta score) are neither monotone nor anti-monotone
P
C
Score(P) <=> Score(C)
2204/20/23
Discriminatory Subgraph Sampling
Use Metropis-Hastings Algorithm Choose neighbor uniformly as proposal
distribution Compute acceptance probability from the
delta score
Delta Score of j and i
Ratio of degree of i and j
2304/20/23
Datasets
Name # of Graphs
Average Vertex count
Average Edge Count
DTP 1084 43 45
Chess 3196 10.25 -
Mutagenicity
2401 (+) 1936 (-)
17 18
PPI 3 2154 81607
Cell-Graphs
30 2184 36945
2504/20/23
Result Evaluation Metrics Sampling Quality
Our sampling distribution vs target sampling distribution
Median and standard deviation of visit count
How the sampling converges (convergence rate)
Variation Distance:
Scalability Test Experiments on large datasets
Quality of Sampled Patterns
1( , ) ( )
2t
yP x y y
2604/20/23
Uniform Sampling ResultsExperiment Setup
Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution
For a dataset with n frequent patterns, we perform 200*n iterations
Result on DTP Chemical Dataset
Uniform Sampling
Maxcount
Mincount
Median Std
338 32 209 59.02
Ideal Sampling
Median Std
200 14.11
2704/20/23
Sampling QualityDepends on the choice of proposal distribution
If the vertices of POG have similar degree values, sampling is good
Earlier dataset have patterns with widely varying degree values
[
For clique dataset, sampling quality is almost perfect
Result on Chess (Itemset) Dataset
(100*n iterations)
Uniform Sampling
Maxcount
Mincount
Median Std
156 6 100 13.64
Ideal Sampling
Median Std
100 102804/20/23
Discriminatory sampling results (Mutagenicity dataset)
Distribution of Delta Score among all frequent
Patterns
Relation between sampling rate and Delta Score
2904/20/23
Discriminatory sampling results (cont)Sample No Delta
ScoreRan
k% of POG explored
1 404 132 5.7
2 644 21 11.0
3 707 10 10.8
4 725 4 8.9
5 280 595 2.8
6 725 4 8.9
7 627 27 3.3
8 709 9 7.7
9 721 5 9.1
10 725 4 8.9
3004/20/23
Discriminatory sampling results (cell Graphs)
Total graphs 30, min-sup = 6
No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine)
3104/20/23
Number of subgraphs with delta score > 9
0
5
10
15
20
25
30
traditional algorithm OSS
Series1
SummaryExisting Algorithms Output Space Sampling
Random walk on the subgraph space
Arbitrary ExtensionSampling algorithm
Depth-first or Breadth first walk on the subgraph space
Rightmost ExtensionComplete algorithm
Quality: Sampling quality guaranty
Scalability: Visits only a small part of the search space
Non-Redundant: finds very dissimilar patterns by virtue of randomness
Genericity: In terms of pattern type and sampling objective
3204/20/23
Future Works and Discussion Important to choose proposal distribution wisely
to get better sampling
For large graph, support counting is still a bottleneck
How to scrap the isomorphism checking entirely How to effectively parallelize the support counting
How to make the random walk to converge faster The POG graph generally have smaller spectral
gap, as a result the convergence is slow. This makes the algorithm costly (more steps to find
good samples)
3304/20/23
Acceptance Probability Computation
Desired Distribution
Proposal Distribution
Interestingness value
3604/20/23
Support Biased Sampling
s1 s2 s3 sn
Graphs
Support
We want,
( ) i
ii
si
s
What proposal distribution to choose?
α=1, if Nup(u) = ø, α=0, if Ndown(u) = ø
1( )
| |( , )
1(1 ) ( )
| |
if
if
upup
downdown
v N uN
Q u v
v N uN
u
link3704/20/23
Example of Support Biased Sampling
B
A
D
A
DP3 x 1/92 X 1/2
α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9s(u) = 2s(v) = 3
31
3804/20/23
Sampling Convergence
3904/20/23
Support Biased SamplingScatter plot of Visit count and Support shows
positive Correlation
Correlation: 0.76
4004/20/23
Specific Sampling Examples and Utilization Uniform Sampling of Frequent Pattern
To explore the frequent patterns To set a proper value of minimum support To make an approximate counting
Support Biased Sampling To find Top-k Pattern in terms of support value
Discriminatory subgraph sampling Finding subgraphs that are good features for
classification
4104/20/23