Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund Deshpande)
Jan 05, 2016
Mining Scientific Data Sets Using
Graphs
George KarypisDepartment of Computer Science &
EngineeringUniversity of Minnesota
(Michihiro Kuramochi & Mukund Deshpande)
NGDM-02
Outline
Mining Scientific Data-setsOpportunities & Challenges
Using Graphs and Mining them
Pattern Discovery in Graphs
Putting Patterns to Good Use
Going Forward
NGDM-02
Data Mining In Scientific Domain
Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial, and
textual data sets.
The success of these techniques has renewed interest in applying them to various scientific and engineering fields. Astronomy Life sciences Ecosystem modeling Fluid dynamics Structural mechanics …
NGDM-02
Challenges in Scientific Data Mining
Most of existing data mining algorithms assume that the data is represented via Transactions (set of items) Sequence of items or events Multi-dimensional vectors Time series
Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. e.g., Numerical simulations, 3D protein structures, chemical
compounds, etc.
Need algorithms that operate on scientific datasets in their native representation
Need algorithms that operate on scientific datasets in their native representation
NGDM-02
How to Model Scientific Datasets?
There are two basic choices Treat each dataset/application differently and develop custom
representations/algorithms. Employ a new way of modeling such datasets and develop
algorithms that span across different applications!
What should be the properties of this general modeling framework? Abstract compared with the original raw data. Yet powerful enough to capture the important characteristics.
Labeled directed/undirected topological/geometric graphs and
hypergraphs
Labeled directed/undirected topological/geometric graphs and
hypergraphs
NGDM-02
Modeling Data With Graphs…Going Beyond Transactions
Graphs are suitable for capturing arbitrary relations between the various elements.
VertexElement
Element’s Attributes
Relation BetweenTwo Elements
Type Of Relation
Vertex Label
Edge Label
Edge
Data Instance Graph Instance
Relation between a Set of Elements
Hyper Edge
Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of
relations to be modeled
NGDM-02
Example: Protein 3D Structure
PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor ProteinAlzheimer's disease amyloid A4 protein precursor
β
α
β
β
β
β
α
β
β
Backbone
Contact
NGDM-02
Example: Fluid Dynamics
Vertices Vortices
Edges Proximity
NGDM-02
Graph Mining
Goal: Develop algorithms to mine and analyze graph data sets. Finding patterns in these graphs Finding groups of similar graphs (clustering) Building predictive models for the graphs (classification)
Applications Structural motif discovery High-throughput screening Protein fold recognition VLSI reverse engineering
A lot more …
Beyond Scientific ApplicationsSemantic webMining relational profilesBehavioral modeling Intrusion detectionCitation analysis…
Beyond Scientific ApplicationsSemantic webMining relational profilesBehavioral modeling Intrusion detectionCitation analysis…
NGDM-02
Finding Frequent Patterns in Graphs
A pattern is a relation between the object’s elements that is recurring over and over again. Common structures in a family of chemical compounds or
proteins. Similar arrangements of vortices at different “instances” of
turbulent fluid flows. …
There are two general ways to formally define a pattern in the context of graphs Arbitrary subgraphs (connected or not) Induced subgraphs (connected or not)
Frequent pattern discovery translates to frequent subgraph discovery…
NGDM-02
Finding Frequent Subgraphs:Input and Output
Input Database of graph transactions. Undirected simple graph
(no loops, no multiples edges). Each graph transaction has labels
associated with its vertices and edges. Transactions may not be connected. Minimum support threshold σ.
Output Frequent subgraphs that satisfy the
minimum support constraint. Each frequent subgraph is connected.
Support = 100%
Support = 66%
Support = 66%
Input: Graph Transactions Output: Frequent Connected Subgraphs
NGDM-02
FSG Frequent Subgraph Discovery Algorithm
Single edges
3-candidates
4-candidates
Double edges
3-frequent subgraphs
4-frequent subgraphs
Follows an Apriori-stylelevel-by-level approachand grows the patternsone edge-at-a-time.
NGDM-02
Computational Challenges
Simple operations become complicated & expensive when dealing with graphs…
Candidate generation To determine if we can join two candidates, we need to perform subgraph isomorphism to
determine if they have a common subgraph. There is no obvious way to reduce the number of times that we generate the same subgraph. Need to perform graph isomorphism for redundancy checks. The joining of two frequent subgraphs can lead to multiple candidate subgraphs.
Candidate pruning To check downward closure property, we need subgraph isomorphism.
Frequency counting Subgraph isomorphism for checking containment of a frequent subgraph
Key to FSG’s computational efficiency: Uses an efficient algorithm to determine a canonical labeling of a graph and use these “strings” to
perform identity checks. Uses a sophisticated candidate generation algorithm that reduces the number of times each
candidate is generated. Uses an augmented TID-list based approach to speedup frequency counting.
NGDM-02
Candidate Generation Based On Core Detection
Multiple candidates for the same core!
NGDM-02
Candidate Generation Based On Core Detection
First Core
Second Core
First Core Second Core
Multiple cores between two
(k-1)-subgraphs
NGDM-02
Canonical Labeling
1
1
1111
11
111
1
5
4
3
2
1
0
543210
Av
Av
Bv
Bv
Bv
Bv
AABBBB
vvvvvv
1
1
1
11
111
1111
0
5
4
2
1
3
054213
Bv
Av
Av
Bv
Bv
Bv
BAABBB
vvvvvv
v0
B
v1B
v2B
v3B
v4A
v5A
Label = “1 01 011 0001 00010”
Label = “1 11 100 1000 01000”
NGDM-02
DTP Dataset (chemical compounds)(Random 100K transactions)
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10
Minimum Support [% ]
Runnin
g T
ime [
sec]
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Num
ber
of Patt
ern
s D
isco
vere
d
Running Time [sec]
#Patterns
NGDM-02
DTP Dataset
0
200
400
600
800
1000
1200
1400
1600
0 20000 40000 60000 80000 100000
Number of Transactions
Run
ning
Tim
e [s
ec]
NGDM-02
Topology Is Not Enough (Sometimes)
O
O
I
OH
H
H
H
H
H
H
H
H
H
H
H
H
H
H H
H
H
H
H
H
H
H
O
OH
H
H
H
H
HH
H
H
H
H
H
OH
HH
H
H
H H
H
H
H
H
H
H
H
Graphs arising from physical domains have a strong geometric nature. This geometry must be taken into
account by the data-mining algorithms.
Geometric graphs. Vertices have physical 2D and 3D
coordinates associated with them.
NGDM-02
gFSG—Geometric Extension Of FSG
Same input and same output as FSG Finds frequent geometric connected
subgraphs
Geometric version of (sub)graph isomorphism The mapping of vertices can be
translation, rotation, and/or scaling invariant.
The matching of coordinates can be inexact as long as they are within a tolerance radius of r.
R-tolerant geometric isomorphism.
A
B
C
NGDM-02
Use Of Geometry
Transformation-invariant signatures enable quick identity checks Normalized sum of distances from the center to each vertex A sorted list of edge angles
To compare signatures, use a certain threshold No canonical labeling
For the subgraph isomorphism, coordinates of vertices also work as strong constraints to narrow down the search space of combination. Not only the vertex and edge labels, now the coordinates must be
matched.
R-tolerance makes the problem of finding all patterns extremely hard. Patterns that are 2R-isomorphic to each other will not be counted
properly
NGDM-02
Performance of gFSG
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000Number of Transactions
Runnin
g T
ime [
sec]
w scalingw/ o scaling
Different number of transactions randomly sampled from DTP datasetAverage transaction size about 23Minimum support 1.0%
NGDM-02
Example
O
A discovered pattern
HO
O
NSC 4960
NSC 191370
O O
NH
O
HN
O
O
SH
NSC 40773
O
O
O
HO
O
HO
O
O
NSC 164863 NS
H2N O
OOO
O
O O
O
OO
OH
O
NSC 699181
NGDM-02
Putting Patterns to Good Use…
NGDM-02
Drug Development Cycle
Idea for drug target
Drug screening/rational drug design/direct synthesis
Small scale production
Laboratory andanimal testing
Production forclinical trials
File IND
NGDM-02
Graph Classification Approach
Discover FrequentSub-graphs
1 Select DiscriminatingFeatures
2
Learn a ClassificationModel
4 Transform Graphs
in Feature Representation
3
Graph Databases
NGDM-02
Chemical Compound Datasets
Predictive Toxicology Challenge (PTC) Predicting toxicity (carcinogenicity) of compounds. Bio assays on four kinds of rodents 4 Classification Problems -- Approx 400 chemical compounds.
DTP AIDS Antiviral Screen (AIDS) Predicting anti-HIV activity of compounds. Assay to measure protection of human cells against HIV infection. 3 Classification problems -- Approx 40,000 chemical compounds.
Anthrax Predicting binding ability of compounds with the anthrax toxin. Expensive molecular dynamics simulation Collaboration with Dr. Frank Lebeda, USAMRIID Approx 35,000 chemical compounds
NGDM-02
Comparison with PTC and HIV
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Tru
e P
osi
tive
Rat
e
(female mice)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
False Positive RateT
rue
Po
siti
ve R
ate
Aids-AM Aids-AI Aids-CACI
(HIV screening)
NGDM-02
Anthrax
NGDM-02
Comparison of Topological & Geometric Features on Anthrax Dataset
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Tru
e P
osi
tive
Rat
e
Topological Geometric
NGDM-02
Most Discriminating Subbgraphs
HOO
OH
O
NH
NH2
ONH2
NH2
O
(a) On Toxicology (PTC) Dataset
(b) On AIDS Dataset
(c) On Anthrax Dataset
HO
OHH2N
SHN
N
NSO NH2
NGDM-02
Moving Forwards
Graphs provide a powerful mechanism to represent relational and physical datasets.
Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem.
Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…
NGDM-02
Research on Pattern Discovery
Robust algorithms for mining 3D geometric graphs extensive applications in proteomics
Algorithms for finding approximate patterns allow for a limited number of changes there is always variation in the physical world
Algorithms to find patterns in single large graphs what is a pattern? what is its support? do we allow for overlap?
…
NGDM-02
Research on Classification
Position specific modelsa substructure at the surface of the protein is
in general more important than the same substructure being buried
Efficient Graph-based kernel methodsalgebra of graphs?
…
NGDM-02
Research on Clustering
Efficient methods to compute graph similaritiesspectral properties?graph edit distance?
Graph consensus representations
Multiple graph “alignments”
…
NGDM-02
Graphs, graphs, and more graphs…
Graphs with multi-dimensional labels
Stream graphsphone-network connections
Hypergraphscompact representation of set relations
Benchmarks and real-life test cases!
NGDM-02
Thank you!
http://www.cs.umn.edu/~karypis