Chen Chen 1 , Cindy X. Lin 1 , Matt Fredrikson 2 , Mihai Christodorescu 3 , Xifeng Yan 4 , Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University of Wisconsin at Madison 3 IBM T. J. Watson Research Center 4 University of California at Santa Barbara 1
36
Embed
Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1
1University of Illinois at Urbana-Champaign2University of Wisconsin at Madison
3IBM T. J. Watson Research Center4University of California at Santa Barbara
1
OutlineMotivation
The efficiency bottleneck encountered in big networks
Patterns must be preservedSummarize-MineExperimentsSummary
2
3
Frequent Subgraph MiningFind all graphs p such that |Dp| >= min_supGet into the topological structures of graph
dataUseful for many downstream applications
4
ChallengesSubgraph isomorphism checking is inevitable
for any frequent subgraph mining algorithmThis will have problems on big networks
Suppose there is only one triangle in the network
But there are 1,000,000 length-2 pathsWe must enumerate all these 1,000,000,
because any one of them has the potential to grow into a full triangle
5
Too Many EmbeddingsSubgraph isomorphism is NP-hard
So, when the problem size increases, …During the checking, large graphs are grown
from small subpartsFor small subparts, there might be too many
(overlapped) embeddings in a big networkSuch embedding enumerations will finally kill
us
6
Motivating ApplicationSystem call graphs from security research
Model dependencies among system callsUnique subgraph signatures for malicious
programsCompare malicious/benign programs
These graphs are very bigThousands of nodes on averageWe tried state-of-art mining technologies, but
failed
7
Our ApproachSubgraph isomorphism checking cannot be
done on large networksSo we do it on small graphs
Summarize-MineSummarize: Merge nodes by label and collapse
corresponding edgesMine: Now, state-of-art algorithms should work
8
Mining after Summarization
Summarize
G1
g1
G2
g2
… Original
Summary
Mining&
Output
a
b
c
a
a c
ab
a
b
a
bc
…
…
…
…c
…
9
Remedy for Pattern ChangesFrequent subgraphs are presented on a
different abstraction levelFalse negatives & false positives, compared to
true patterns mined from the un-summarized database D