Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

Mohammad Hasan, Mohammed Zaki

RPI, Troy, NY

Consider the following problem from Medical Informatics

Healthy

Diseased

Damaged

Tissue Images

Cell Graphs

Discriminatory Subgraphs

Classifier

404/20/23

Mining Task Dataset

30 graphs Average vertex count: 2154 Average edge count: 36945

Support 40%

Result No Result (used gSpan, Gaston) in a week of

running on 2 GHz dual-core PC with 4 GB running Linux

504/20/23

Limitations of Existing Subgraph Mining Algorithms Work only for small graphs

The most popular datasets in graph mining are chemical graphs Chemical graphs are mostly tree In DTP dataset (most popular dataset) average vertex count is

43 and average edge count is 45

Perform a complete enumeration For large input graph, output set is neither enumerable nor

usable

They follow a fixed enumeration order

Partial run does not efficiently generate the interesting subgraphs

avoid complete enumeration to sample a set of

interesting subgraphs from the output set 604/20/23

Why sampling a solution? Observation 1:

Mining is only exploratory step, mined patterns are generally used in subsequent KD task

Not all frequent patterns are equally important for the desired task at hand

Large output set leads to information overload problem

Observation 2: Traditional mining algorithms explore the output space with a fixed

enumeration order Good for generating non-duplicate candidate patterns But, subsequent patterns in that order are very similar

complete enumeration is generally unnecessary

Sampling can change enumeration order to sample interesting

and non-redundant subgraphs with a higher chance 704/20/23

Output Space Traditional frequent subgraphs for a given support threshold

Can also augment with other constraint To find good patterns for the desired KD task

Input Space

Output Space for FPM with support = 2

904/20/23

Sampling from Output Space

Return a random pattern from the output set

Random pattern is obtained by sampling from a desired distribution

Define an interestingness function, f : FR+; f(p) returns the score of pattern p

The desired sampling distribution is proportional to the interestingness score

If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution

Efficiency consideration Enumerate as few auxiliary patterns as possible

1004/20/23

How to choose f?

Depends on application needs

For exploratory data analysis (EDA), every frequent pattern can have a uniform score

For Top-K pattern mining, support values can be used as scores, which is support biased sampling.

For subgraph summarization task, only maximal graph patterns has uniform non-zero score

For graph classification, discriminatory subgraphs should have high scores

1104/20/23

Challenges

The output space can not be instantiate

Complete statistics about the output space is not known.

Target distribution is not known entirely

Output Space of Graph Mining

g1

g3

g2

g4

g5

s1 s2 s3 sn

GraphsScores

We want, ( ) i

ii

si

s

1304/20/23

MCMC Sampling

In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge

Solution Approach (MCMC Sampling)

Perform random walk in the output space

Represent the output space as a transition graph to allow local transitions

Edges of transition graph are chosen based on structural similarity

Make sure that the random walk is ergodic

POG as transition graph

1404/20/23

Algorithm

Define the transition graph (for instance, POG)

Define interestingness function that select desired sampling distribution

Perform random walk on the transition graph

Compute the neighborhood locally

Compute Transition probability Utilize the interestingness score makes the method generic

Return the currently visiting pattern after k iterations.

1504/20/23

Local Computation of Output Space

g0

Super Patterns

Sub Patterns

Pattern that are not part of the output space is discarded during local neighborhood computation

P01 p02 p03 p04 p05 p00

g1

g2 g3

g5g4

g5g2 g4g3g1 u

Σ =11604/20/23

Compute P to achieve Target Distribution

If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have,

Main task is to choose P, so that the desired stationary distribution is achieved

In fact, we compute only one row of P (local computation)

P

s1 s2 s3 sn

Graphs

Scores

We want,

( ) i

ii

si

s

1704/20/23

Use Metropolis-Hastings (MH) Algorithm

1. Fix an arbitrary proposal distribution beforehand (q)

2. Find a neighbor j (to move to) by using the above distribution

3. Compute acceptance probability and accept the move with this probability

4. If accept move to j; otherwise, go to step 2

1 2 3

0

4 5

q01 q02 q03 q04 q05 q00

Select 3

1,min030

30303 qs

qs

04/20/23

Uniform Sampling of Frequent Patterns

Target Distribution1/n, 1/n, . . . , 1/n

How to achieve it?Use uniform proposal

distributionAcceptance probability is:

dx: Degree of a vertex x

min 1, u

v

d

d

1904/20/23

Uniform Sampling, Transition Probability Matrix

B

A

D

A

DP14

2004/20/23

Discriminatory Subgraph Sampling

Database graphs are labeled

Subgraphs may be used as Feature for supervised classification Graph Kernel

Graph Label

G1G2G3

+1+1-1

Subgraph

Mininggraphs

g1

g2

g3

. .

.

G1

G2

G3Em

beddin

g

Counts

Or

Binar

y

2104/20/23

Sampling in Proportion to Discriminatory Score (f)

Interestingness score (feature quality) Entropy Delta score = abs (positive support – negative

support)

Direct Mining is difficult

Score values (entropy, delta score) are neither monotone nor anti-monotone

P

C

Score(P) <=> Score(C)

2204/20/23

Discriminatory Subgraph Sampling

Use Metropis-Hastings Algorithm Choose neighbor uniformly as proposal

distribution Compute acceptance probability from the

delta score

Delta Score of j and i

Ratio of degree of i and j

2304/20/23

Datasets

Name # of Graphs

Average Vertex count

Average Edge Count

DTP 1084 43 45

Chess 3196 10.25 -

Mutagenicity

2401 (+) 1936 (-)

17 18

PPI 3 2154 81607

Cell-Graphs

30 2184 36945

2504/20/23

Result Evaluation Metrics Sampling Quality

Our sampling distribution vs target sampling distribution

Median and standard deviation of visit count

How the sampling converges (convergence rate)

Variation Distance:

Scalability Test Experiments on large datasets

Quality of Sampled Patterns

1( , ) ( )

2t

yP x y y

2604/20/23

Uniform Sampling ResultsExperiment Setup

Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution

For a dataset with n frequent patterns, we perform 200*n iterations

Result on DTP Chemical Dataset

Uniform Sampling

Maxcount

Mincount

Median Std

338 32 209 59.02

Ideal Sampling

Median Std

200 14.11

2704/20/23

Sampling QualityDepends on the choice of proposal distribution

If the vertices of POG have similar degree values, sampling is good

Earlier dataset have patterns with widely varying degree values

[

For clique dataset, sampling quality is almost perfect

Result on Chess (Itemset) Dataset

(100*n iterations)

Uniform Sampling

Maxcount

Mincount

Median Std

156 6 100 13.64

Ideal Sampling

Median Std

100 102804/20/23

Discriminatory sampling results (Mutagenicity dataset)

Distribution of Delta Score among all frequent

Patterns

Relation between sampling rate and Delta Score

2904/20/23

Discriminatory sampling results (cont)Sample No Delta

ScoreRan

k% of POG explored

1 404 132 5.7

2 644 21 11.0

3 707 10 10.8

4 725 4 8.9

5 280 595 2.8

6 725 4 8.9

7 627 27 3.3

8 709 9 7.7

9 721 5 9.1

10 725 4 8.9

3004/20/23

Discriminatory sampling results (cell Graphs)

Total graphs 30, min-sup = 6

No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine)

3104/20/23

Number of subgraphs with delta score > 9

0

5

10

15

20

25

30

traditional algorithm OSS

Series1

SummaryExisting Algorithms Output Space Sampling

Random walk on the subgraph space

Arbitrary ExtensionSampling algorithm

Depth-first or Breadth first walk on the subgraph space

Rightmost ExtensionComplete algorithm

Quality: Sampling quality guaranty

Scalability: Visits only a small part of the search space

Non-Redundant: finds very dissimilar patterns by virtue of randomness

Genericity: In terms of pattern type and sampling objective

3204/20/23

Future Works and Discussion Important to choose proposal distribution wisely

to get better sampling

For large graph, support counting is still a bottleneck

How to scrap the isomorphism checking entirely How to effectively parallelize the support counting

How to make the random walk to converge faster The POG graph generally have smaller spectral

gap, as a result the convergence is slow. This makes the algorithm costly (more steps to find

good samples)

3304/20/23

Acceptance Probability Computation

Desired Distribution

Proposal Distribution

Interestingness value

3604/20/23

Support Biased Sampling

s1 s2 s3 sn

Graphs

Support

We want,

( ) i

ii

si

s

What proposal distribution to choose?

α=1, if Nup(u) = ø, α=0, if Ndown(u) = ø

1( )

| |( , )

1(1 ) ( )

| |

if

if

upup

downdown

v N uN

Q u v

v N uN

u

link3704/20/23

Example of Support Biased Sampling

B

A

D

A

DP3 x 1/92 X 1/2

α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9s(u) = 2s(v) = 3

31

3804/20/23

Sampling Convergence

3904/20/23

Support Biased SamplingScatter plot of Visit count and Support shows

positive Correlation

Correlation: 0.76

4004/20/23

Specific Sampling Examples and Utilization Uniform Sampling of Frequent Pattern

To explore the frequent patterns To set a proper value of minimum support To make an approximate counting

Support Biased Sampling To find Top-k Pattern in terms of support value

Discriminatory subgraph sampling Finding subgraphs that are good features for

classification

4104/20/23

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

Documents

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.