Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

Mining Scientific Data Sets Using

Graphs

George KarypisDepartment of Computer Science &

EngineeringUniversity of Minnesota

(Michihiro Kuramochi & Mukund Deshpande)

NGDM-02

Outline

Mining Scientific Data-setsOpportunities & Challenges

Using Graphs and Mining them

Pattern Discovery in Graphs

Putting Patterns to Good Use

Going Forward

NGDM-02

Data Mining In Scientific Domain

Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial, and

textual data sets.

The success of these techniques has renewed interest in applying them to various scientific and engineering fields. Astronomy Life sciences Ecosystem modeling Fluid dynamics Structural mechanics …

NGDM-02

Challenges in Scientific Data Mining

Most of existing data mining algorithms assume that the data is represented via Transactions (set of items) Sequence of items or events Multi-dimensional vectors Time series

Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. e.g., Numerical simulations, 3D protein structures, chemical

compounds, etc.

Need algorithms that operate on scientific datasets in their native representation

Need algorithms that operate on scientific datasets in their native representation

NGDM-02

How to Model Scientific Datasets?

There are two basic choices Treat each dataset/application differently and develop custom

representations/algorithms. Employ a new way of modeling such datasets and develop

algorithms that span across different applications!

What should be the properties of this general modeling framework? Abstract compared with the original raw data. Yet powerful enough to capture the important characteristics.

Labeled directed/undirected topological/geometric graphs and

hypergraphs

Labeled directed/undirected topological/geometric graphs and

hypergraphs

NGDM-02

Modeling Data With Graphs…Going Beyond Transactions

Graphs are suitable for capturing arbitrary relations between the various elements.

VertexElement

Element’s Attributes

Relation BetweenTwo Elements

Type Of Relation

Vertex Label

Edge Label

Edge

Data Instance Graph Instance

Relation between a Set of Elements

Hyper Edge

Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of

relations to be modeled

NGDM-02

Example: Protein 3D Structure

PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor ProteinAlzheimer's disease amyloid A4 protein precursor

β

α

β

β

β

β

α

β

β

Backbone

Contact

NGDM-02

Example: Fluid Dynamics

Vertices Vortices

Edges Proximity

NGDM-02

Graph Mining

Goal: Develop algorithms to mine and analyze graph data sets. Finding patterns in these graphs Finding groups of similar graphs (clustering) Building predictive models for the graphs (classification)

Applications Structural motif discovery High-throughput screening Protein fold recognition VLSI reverse engineering

A lot more …

Beyond Scientific ApplicationsSemantic webMining relational profilesBehavioral modeling Intrusion detectionCitation analysis…

Beyond Scientific ApplicationsSemantic webMining relational profilesBehavioral modeling Intrusion detectionCitation analysis…

NGDM-02

Finding Frequent Patterns in Graphs

A pattern is a relation between the object’s elements that is recurring over and over again. Common structures in a family of chemical compounds or

proteins. Similar arrangements of vortices at different “instances” of

turbulent fluid flows. …

There are two general ways to formally define a pattern in the context of graphs Arbitrary subgraphs (connected or not) Induced subgraphs (connected or not)

Frequent pattern discovery translates to frequent subgraph discovery…

NGDM-02

Finding Frequent Subgraphs:Input and Output

Input Database of graph transactions. Undirected simple graph

(no loops, no multiples edges). Each graph transaction has labels

associated with its vertices and edges. Transactions may not be connected. Minimum support threshold σ.

Output Frequent subgraphs that satisfy the

minimum support constraint. Each frequent subgraph is connected.

Support = 100%

Support = 66%

Support = 66%

Input: Graph Transactions Output: Frequent Connected Subgraphs

NGDM-02

FSG Frequent Subgraph Discovery Algorithm

Single edges

3-candidates

4-candidates

Double edges

3-frequent subgraphs

4-frequent subgraphs

Follows an Apriori-stylelevel-by-level approachand grows the patternsone edge-at-a-time.

NGDM-02

Computational Challenges

Simple operations become complicated & expensive when dealing with graphs…

Candidate generation To determine if we can join two candidates, we need to perform subgraph isomorphism to

determine if they have a common subgraph. There is no obvious way to reduce the number of times that we generate the same subgraph. Need to perform graph isomorphism for redundancy checks. The joining of two frequent subgraphs can lead to multiple candidate subgraphs.

Candidate pruning To check downward closure property, we need subgraph isomorphism.

Frequency counting Subgraph isomorphism for checking containment of a frequent subgraph

Key to FSG’s computational efficiency: Uses an efficient algorithm to determine a canonical labeling of a graph and use these “strings” to

perform identity checks. Uses a sophisticated candidate generation algorithm that reduces the number of times each

candidate is generated. Uses an augmented TID-list based approach to speedup frequency counting.

NGDM-02

Candidate Generation Based On Core Detection

Multiple candidates for the same core!

NGDM-02

Candidate Generation Based On Core Detection

First Core

Second Core

First Core Second Core

Multiple cores between two

(k-1)-subgraphs

NGDM-02

Canonical Labeling

1

1

1111

11

111

1

5

4

3

2

1

0

543210

Av

Av

Bv

Bv

Bv

Bv

AABBBB

vvvvvv

1

1

1

11

111

1111

0

5

4

2

1

3

054213

Bv

Av

Av

Bv

Bv

Bv

BAABBB

vvvvvv

v0

B

v1B

v2B

v3B

v4A

v5A

Label = “1 01 011 0001 00010”

Label = “1 11 100 1000 01000”

NGDM-02

DTP Dataset (chemical compounds)(Random 100K transactions)

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10

Minimum Support [% ]

Runnin

g T

ime [

sec]

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Num

ber

of Patt

ern

s D

isco

vere

d

Running Time [sec]

#Patterns

NGDM-02

DTP Dataset

0

200

400

600

800

1000

1200

1400

1600

0 20000 40000 60000 80000 100000

Number of Transactions

Run

ning

Tim

e [s

ec]

NGDM-02

Topology Is Not Enough (Sometimes)

O

O

I

OH

H

H

H

H

H

H

H

H

H

H

H

H

H

H H

H

H

H

H

H

H

H

O

OH

H

H

H

H

HH

H

H

H

H

H

OH

HH

H

H

H H

H

H

H

H

H

H

H

Graphs arising from physical domains have a strong geometric nature. This geometry must be taken into

account by the data-mining algorithms.

Geometric graphs. Vertices have physical 2D and 3D

coordinates associated with them.

NGDM-02

gFSG—Geometric Extension Of FSG

Same input and same output as FSG Finds frequent geometric connected

subgraphs

Geometric version of (sub)graph isomorphism The mapping of vertices can be

translation, rotation, and/or scaling invariant.

The matching of coordinates can be inexact as long as they are within a tolerance radius of r.

R-tolerant geometric isomorphism.

A

B

C

NGDM-02

Use Of Geometry

Transformation-invariant signatures enable quick identity checks Normalized sum of distances from the center to each vertex A sorted list of edge angles

To compare signatures, use a certain threshold No canonical labeling

For the subgraph isomorphism, coordinates of vertices also work as strong constraints to narrow down the search space of combination. Not only the vertex and edge labels, now the coordinates must be

matched.

R-tolerance makes the problem of finding all patterns extremely hard. Patterns that are 2R-isomorphic to each other will not be counted

properly

NGDM-02

Performance of gFSG

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000Number of Transactions

Runnin

g T

ime [

sec]

w scalingw/ o scaling

Different number of transactions randomly sampled from DTP datasetAverage transaction size about 23Minimum support 1.0%

NGDM-02

Example

O

A discovered pattern

HO

O

NSC 4960

NSC 191370

O O

NH

O

HN

O

O

SH

NSC 40773

O

O

O

HO

O

HO

O

O

NSC 164863 NS

H2N O

OOO

O

O O

O

OO

OH

O

NSC 699181

NGDM-02

Putting Patterns to Good Use…

NGDM-02

Drug Development Cycle

Idea for drug target

Drug screening/rational drug design/direct synthesis

Small scale production

Laboratory andanimal testing

Production forclinical trials

File IND

NGDM-02

Graph Classification Approach

Discover FrequentSub-graphs

1 Select DiscriminatingFeatures

2

Learn a ClassificationModel

4 Transform Graphs

in Feature Representation

3

Graph Databases

NGDM-02

Chemical Compound Datasets

Predictive Toxicology Challenge (PTC) Predicting toxicity (carcinogenicity) of compounds. Bio assays on four kinds of rodents 4 Classification Problems -- Approx 400 chemical compounds.

DTP AIDS Antiviral Screen (AIDS) Predicting anti-HIV activity of compounds. Assay to measure protection of human cells against HIV infection. 3 Classification problems -- Approx 40,000 chemical compounds.

Anthrax Predicting binding ability of compounds with the anthrax toxin. Expensive molecular dynamics simulation Collaboration with Dr. Frank Lebeda, USAMRIID Approx 35,000 chemical compounds

NGDM-02

Comparison with PTC and HIV

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

False Positive Rate

Tru

e P

osi

tive

Rat

e

(female mice)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

False Positive RateT

rue

Po

siti

ve R

ate

Aids-AM Aids-AI Aids-CACI

(HIV screening)

NGDM-02

Anthrax

NGDM-02

Comparison of Topological & Geometric Features on Anthrax Dataset

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

False Positive Rate

Tru

e P

osi

tive

Rat

e

Topological Geometric

NGDM-02

Most Discriminating Subbgraphs

HOO

OH

O

NH

NH2

ONH2

NH2

O

(a) On Toxicology (PTC) Dataset

(b) On AIDS Dataset

(c) On Anthrax Dataset

HO

OHH2N

SHN

N

NSO NH2

NGDM-02

Moving Forwards

Graphs provide a powerful mechanism to represent relational and physical datasets.

Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem.

Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…

NGDM-02

Research on Pattern Discovery

Robust algorithms for mining 3D geometric graphs extensive applications in proteomics

Algorithms for finding approximate patterns allow for a limited number of changes there is always variation in the physical world

Algorithms to find patterns in single large graphs what is a pattern? what is its support? do we allow for overlap?

…

NGDM-02

Research on Classification

Position specific modelsa substructure at the surface of the protein is

in general more important than the same substructure being buried

Efficient Graph-based kernel methodsalgebra of graphs?

…

NGDM-02

Research on Clustering

Efficient methods to compute graph similaritiesspectral properties?graph edit distance?

Graph consensus representations

Multiple graph “alignments”

…

NGDM-02

Graphs, graphs, and more graphs…

Graphs with multi-dimensional labels

Stream graphsphone-network connections

Hypergraphscompact representation of set relations

Benchmarks and real-life test cases!

NGDM-02

Thank you!

http://www.cs.umn.edu/~karypis

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

Documents

scientific data sets

modeling data

underlying data

graph data sets

large data sets

textual data sets

scientific domaindata

original raw data