Top Banner
Chen Chen 1 , Cindy X. Lin 1 , Matt Fredrikson 2 , Mihai Christodorescu 3 , Xifeng Yan 4 , Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University of Wisconsin at Madison 3 IBM T. J. Watson Research Center 4 University of California at Santa Barbara 1
36

Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1

1University of Illinois at Urbana-Champaign2University of Wisconsin at Madison

3IBM T. J. Watson Research Center4University of California at Santa Barbara

1

Page 2: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

OutlineMotivation

The efficiency bottleneck encountered in big networks

Patterns must be preservedSummarize-MineExperimentsSummary

2

Page 3: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

3

Page 4: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Frequent Subgraph MiningFind all graphs p such that |Dp| >= min_supGet into the topological structures of graph

dataUseful for many downstream applications

4

Page 5: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

ChallengesSubgraph isomorphism checking is inevitable

for any frequent subgraph mining algorithmThis will have problems on big networks

Suppose there is only one triangle in the network

But there are 1,000,000 length-2 pathsWe must enumerate all these 1,000,000,

because any one of them has the potential to grow into a full triangle

5

Page 6: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Too Many EmbeddingsSubgraph isomorphism is NP-hard

So, when the problem size increases, …During the checking, large graphs are grown

from small subpartsFor small subparts, there might be too many

(overlapped) embeddings in a big networkSuch embedding enumerations will finally kill

us

6

Page 7: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Motivating ApplicationSystem call graphs from security research

Model dependencies among system callsUnique subgraph signatures for malicious

programsCompare malicious/benign programs

These graphs are very bigThousands of nodes on averageWe tried state-of-art mining technologies, but

failed

7

Page 8: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Our ApproachSubgraph isomorphism checking cannot be

done on large networksSo we do it on small graphs

Summarize-MineSummarize: Merge nodes by label and collapse

corresponding edgesMine: Now, state-of-art algorithms should work

8

Page 9: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Mining after Summarization

Summarize

G1

g1

G2

g2

… Original

Summary

Mining&

Output

a

b

c

a

a c

ab

a

b

a

bc

…c

9

Page 10: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Remedy for Pattern ChangesFrequent subgraphs are presented on a

different abstraction levelFalse negatives & false positives, compared to

true patterns mined from the un-summarized database D

False negatives (recover)Randomized technique + multiple rounds

False positives (delete)Verify against DSubstantial work can be transferred to the

summaries10

Page 11: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

OutlineMotivationSummarize-Mine

The algorithm flow-chartRecovering false negativesVerifying false positives

ExperimentsSummary

11

Page 12: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

12

Page 13: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

False NegativesFor a pattern p, if each of its vertices bears a

different label, then the embeddings of p must be preserved after summarization

Since we are merging groups of vertices by label, the nodes of p should stay in different groups

Otherwise,

...

a

b

a

c

b

a

Gigi

c

a

bcp

13

Page 14: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Missing Prob. of EmbeddingsSuppose

Assign xj nodes for label lj (j=1,…,L) in the summary Si => xj groups of nodes with label lj

in the original graph Gi

Pattern p has mj nodes with label lj

Then

14

Page 15: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

No “Collision” for Same LabelsConsider a specific embedding f: p->Gi, f is

preserved if vertices in f(p) stay in different groups

Randomly assign mj nodes with label lj to xj

groups, the probability that they will not “collide” is:

Multiply probabilities for independent events15

Page 16: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

ExampleA pattern with 5 labels, each label => 2

verticesm1 = m2 = m3 = m4 = m5 = 2

Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each labelThe summary has 100 verticesx1 = x2 = x3 = x4 = x5 = 20

The probability that an embedding will persist

16

774.020

19

20

19

20

19

20

19

20

19

Page 17: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Extend to Multiple GraphsSetting x1,…,xL to the same values across all

Gi’s in the database only depends on m1,…,mL, i.e., pattern

p’s vertex label distribution We denote this probability as q(p)

For each of p’s support graphs in D, it has a probability of at least q(p) to continue support pThus, the overall support can be bounded

below by a binomial random variable

17

Page 18: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Support Moves Downward

18

Page 19: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

False Negative Bound

19

Page 20: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Example, Cont.As above, q(p)=0.774min_sup=50

20

min_sup' 40 39 38 37 36 35

1 round 0.5966

0.4622

0.3346

0.2255

0.1412

0.0820

2 rounds 0.3559

0.2136

0.1119

0.0508

0.0199

0.0067

3 rounds 0.2123

0.0988

0.0374

0.0115

0.0028

0.0006

Page 21: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

False Positives

Much easier to handleJust check against the original database DDiscard if this “actual” support is less than

min_sup

a

b

a

cb

a

Gi

gi

c

p

a

a cb

a

21

Page 22: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

The Same Skeleton as gSpanDFS code treeDepth-first search

Minimum DFS code?Check support by

isomorphism testsRecord all one-edge

extensions along the way

Pass down the projected database and recurse

22

Page 23: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Integrate Verification SchemesTop-Down and Bottom-UpPossible factors

Amount of false positivesTop-down verification can

be performed earlyTop-down preferred

by experiments

23

Transaction ID list for p1 => Dp1

Just search within Dp1

Transaction ID list for p2 => Dp2

Just search within D-Dp2;if frequent, can stop

Page 24: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Summary-Guided VerificationSubstantial verification work can be

performed on the summaries, as well

24

Got it!

Page 25: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Iterative Summarize-MineUse a single pattern tree to hold all results

spanning across multiple iterationsNo need to combine pattern sets in a final stepAvoid verifying patterns that have already been

checked by previous iterationsVerified support graphs are accurate, they can

help pre-pruning in later iterationsDetails omitted

25

Page 26: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

OutlineMotivationSummarize-MineExperimentsSummary

26

Page 27: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

DatasetReal data

W32.Stration, a family of mass-mailing wormsW32.Virut, W32.Delf, W32.Ldpinch,

W32.Poisonivy, etc.Vertex # up to 20,000 and edge # even higherAvg. # of vertices: 1,300

Synthetic dataSize, # of distinct node/edge labels, etc.Generator details omitted

27

Page 28: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

A Sample Malware SignatureMined from W32.StrationA malware reading and leaking certain

registry settings related to the network devices

28

Page 29: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Comparison with gSpangSpan is an efficient graph pattern mining

algorithmGraphs with different size are randomly

drawnEventually, gSpan cannot work

29

Page 30: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

The Influence of min_sup' Total vs. False PositivesThe gap corresponds to true patternsIt gradually widens as we decrease min_sup'

30

Page 31: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Summarization Ratio10/1 node(s) before/after summarization =>

ratio=10Trading-off min_sup' and t as the inner loopA range of reasonable parameters in the

middle

31

Page 32: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

ScalabilityOn the synthetic dataParameters are tuned as done above

32

Page 33: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

OutlineMotivationSummarize-MineExperimentsSummary

33

Page 34: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

SummaryWe solve the frequent subgraph mining problem

for graphs with big sizeWe found interesting malware signaturesOur algorithm is much more efficient, while the

state-of-art mining technologies do not workWe show that patterns can be well preserved on

higher-level by a good generalization schemeVery useful, given the emerging trend of huge

networksThe data has to be preprocessed and summarized

34

Page 35: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

SummaryOur method is orthogonal to many previous

works on this topic => Combine for further improvementEfficient pattern space traversalOther data space reduction techniques

different from our compression within individual transactions Transaction sampling, merging, etc. They perform compression between transactions

35

Page 36: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

36