Top Banner
Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning
29

Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Dr. Larry HolderSchool of EECS, WSU

Graph-based Pattern Learning

Page 2: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Graphs

Protein-proteinInteraction

PowerGrid

SocialNetwork

Internet

Web

Page 3: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Some Graph Statistics

• Web 10B pages, 1T hyperlinks Topology storage: 10TB Google PageRank: Eigenvector on 10Bx10B

adjacency matrix (sparse)

• MySpace 100M users, 10B friendship links Clique/community detection 300K new users per day

Page 4: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Graph Problems

• Degree

• Diameter

• Centrality

• Shortest path

• Cycles/tours

• Minimum spanning tree

• Traversals/search

• Connectivity

• Clustering

• Partitioning

• Cliques

• Motifs

• Subgraph isomorphism

• Frequent subgraphs

• Pattern learning

• Dynamics

Page 5: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Graph-based Pattern Learning

• Unsupervised pattern discovery

• Hierarchical conceptual clustering

• Supervised pattern learning

• Anomaly detection

• Dynamic graph pattern learning

Page 6: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Unsupervised Pattern Discovery

• Frequency-based (AGM, gSpan, FSG, Gaston) “Graph-based Data Mining” Find all subgraphs g within a set of graph transactions G

such that

where is subgraph isomorphism and t is the minimum support Focus on pruning and fast, code-based graph matching Still requires subgraph isomorphism

t

G

Gggg

||

|

Page 7: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Unsupervised Pattern Discovery• Graph compression and the minimum

description length (MDL) principle The best theory minimizes the description

length of the theory and the description length of the data given the theory

• The best graphical pattern S minimizes the description length of S and the description length of the graph G compressed with pattern S

• where description length DL(G) is the minimum number of bits needed to represent G (SUBDUE)

• Compression can be based on inexact matches to pattern

))|()((min SGDLSDLS

S1

S1

S1

S1

S1 S2

S2 S2

Page 8: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Hierarchical Conceptual Clustering• Use iterative process on input

graph G Repeat

• Find best pattern S in graph G• Add S to hierarchy• G = G compressed with S

Until no more compression

• Clustering is a lattice

• Clusters described by pattern Not just instances as in

traditional clustering techniques

Page 9: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

organization

organization

organization

male

male

male

male

male

male

male

male

male

male

male

place

place

place placelocation

location

location

location

location

location

location

location

location

affiliation

affiliation

affiliation

affiliation

affiliation

affiliation

affiliation

organization

organization

organization

male

male

male

male

male

male

male

male

male

male

male

place

place

place placelocation

location

location

location

location

location

location

location

location

affiliation

affiliation

affiliation

affiliation

affiliation

affiliation

affiliation

Mock TerroristScenario

Event Generator

Fund raisingRecruitmentTrainingReconnaissance...

Message TrafficReports (142)

SRA TEESText Extraction

System

SUBDUEPattern Learner

Entitiesand

Relationships

Convertto

Graph

Patterns

Observables

Hierarchical pattern discovered at 7th iteration of SUBDUE

DHS Insight ProjectTerrorist Group Data

Page 10: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Supervised Learning• Given positive graph G+ and negative graph G-

• Find pattern S minimizing DL(G+ | S) / DL(G- | S)

• When |G+|,|G-| >> 1, find pattern S maximizing classification accuracy:

NP

TNTP

GG

gSGggSGg

||||

|}|{||}|{|

SUBDUE

PositiveGraphs

NegativeGraphs

Pattern(s)

Page 11: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Results Examples Entities Relations Accuracy Time

Events 308 533,196 630,733 80% 86 min

Groups 84 457,209 597,163 85% 813 min

EDB

Convert EDB to SUBDUE graph

format

Positive & negative examples

Patterns

Evaluate

Evidence Assessment, Grouping, Linking and Evaluation (EAGLE) Program

Evidence DB (EDB)contains simulated dataon threat and non-threat activity• Persons, targets, capabilities, resources, transfers, and communications

SUBDUE

Non-threat

Threat

DARPA/AFRL

Page 12: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Graph Regression (with Nikhil Ketkar, WSU)• Learn a model Yi = f(Gi ), where Yi is a

real number and Gi is a graph E.g., solubility or binding activity of chemical

compounds

• One approach Apply frequent-graph miner to set of

training graphs Gi Frequent subgraphs form a feature

vector V Input {(Yi, Vi)} to linear support-vector

machine

• gRegress approach Prune feature set based on correlation

with other features and lack of correlation with Y

• Learn model using non-linear SVM or piece-wise regression

Page 13: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Anomaly Detection (with Bill Eberle, TTU)• Learn normative patterns of activity

• Detect small, unlikely deviations from normative patterns

• Present anomalies and their context to analyst

Anomaly

Convert to graph

GBAD

Activity Data

NormativePatternGraph-Based

Anomaly Detection (GBAD)

SUBDUE

Page 14: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

GBAD Approach

• Determine normative pattern S using SUBDUE minimum description length (MDL) heuristic that minimizes: M(S,G) = DL(G|S) + DL(S)

• Three algorithms for handling each of the different anomaly categories GBAD-MDL finds anomalous modifications GBAD-P (Probability) finds anomalous insertions GBAD-MPS (Maximum Partial Substructure) finds

anomalous deletions

Page 15: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

DHS Insight Project: Cargo Data• Shipment data from PIERS (Port

Import Export Reporting Service)• Only North American imports (U.S.,

Puerto Rico, Canada)• 65,535 records (shipments)• Information categories:

General Commodity codes Countries and ports U.S. company names and locations Foreign shipper names and locations Notification party names and locations Shipping line, vessel and packaging Container Weight and shipment Financial

ARRIVAL_INFO

“020601”

VDATE

SHIPMENT

COMMODITY

“EMPTY RACK”

COMMODITY

COUNTRIES_AND_PORTS

“YOKOHAMA”

“SEATTLE”

“JAPAN”

US_IMPORTER

FPORT

USPORT

COUNTRY

“AMERICAN TRI NET EXPRESS”

NAME

FOREIGN_SHIPPER

“TRI NET”

FNAME

VESSEL

“CSCO”

“LING YUN HE”

36

TARIFF

“CONTAINER FOR ONE OR

MORE MODES OF TRANSPORT”

HARM_DESC

860900

HSCODE

CONTAINER

FINANCIAL

CARGO

HAS_A

HAS_AHAS_A

HAS_A

HAS_A

HAS_A

HAS_A

HAS_A

HAS_A

HAS_A

“TOLU4972933”

CONTAINER

VALUE

27579

00434100

“”

“”

0.00

5.60

BOL_NBR

HAZMAT_FLA

CONSIZE

TEUS

MTONS

SLINE

VESSEL

VOYAGE

Page 16: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Anomaly Detection in Cargo Data• Marijuana seized at port on Florida [U.S. Customers

Service 2000].• Smuggler did not disclose some financial

information, and ship traversed extra port.• GBAD-P discovers the extra traversed port; GBAD-

MPS discovers the missing financial information.

Page 17: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

DHS CyberSecurity R&D Program: Insider Threat Detection using Graphs

Insider Threat Scenarios (CERT Insider Threat Documents)1. Frontline staff reviews case (invasion of privacy).2. Frontline staff submits case directly to a case officer

(bypassing the approval officer).3. Frontline staff recommends or decides case.4. Approval officer reverses accept/reject recommendation

from assigned case officer.5. Unassigned case officer updates or recommends case.6. Applicant communicates with approval officer or case

officer.7. Unassigned case officer communicates with applicant.8. Database access from an external source or after hours.

Gov’t IDRequestProcessing

GBAD on Scenario 1

GBAD on Scenario 4

• 1000 cases• Multiple

normative patterns

• 1-3 anomalies• No false

positives

Page 18: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Dynamic Graph Pattern Learning(with Chang hun You, WSU)

• Dynamic graph DG = {G1, G2, …, Gn}

• Find graph rewrite rules between pairs of graphs Gi / Gi+1

Find common subgraph between Gi and Gi+1

Remainder of Gi to be removed (GR) Remainder of Gi+1 to be added (GA)

• Find transformation rules of temporal patterns in rewrite rules Remove (GR) at time t, then add (GA) at time t+k

Page 19: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Dynamic Graph (BioNet)

Page 20: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Graph Rewriting Rule

Page 21: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Example: Circadian Rhythm in Drosophila (Fruit Fly)

Page 22: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Example: Circadian Rhythm in Drosophila (Fruit Fly)

Transformation rule (Sub 1): Structure appearing and disappearing in network.

Full temporal transformation rule: Boxes are removals (after 5 hours), and ellipses are additions (after 7 hours) of Sub 1. Cycles every 12 hours. Time 6-47 is training; time 54-66 is prediction.

Page 23: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Graph-based Pattern Learning

• Algorithms Pattern discovery and

clustering Supervised learning Anomaly detection Dynamic graphs

• Applications Social networks Biological networks Computer networks Process flows (Semantic) Web …

linkeddata.org

Page 24: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

High Performance Computing Issues

• Memory bottleneck Most real-world graphs do not fit in main memory Patterns of access to graph not sequential

• Computational bottleneck Graph and subgraph isomorphism

Page 25: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

High Performance Computing Issues

• Functional parallelism Parallel search over space of candidate subgraph

patterns• High communication to avoid redundancy• Child patterns rely on embeddings kept with parent Hinders parallelism Computing embeddings from scratch is NPC

• Data parallelism Partition graphs, find patterns in each partition, evaluate

patterns in other partitions• Edge cuts may break patterns• May require NPC subgraph isomorphism

Page 26: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

• MapReduce [Google] Dean & Ghemawat, “MapReduce: Simplified Data Processing on

Large Clusters,” OSDI 2004.

• Hadoop [Yahoo] MapReduce Distributed filesystem

Data-Intensive Scalable Computing

Map

Reduce

Page 27: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Multiscale Issues

• Hierarchical networks Higher-level hyper-nodes

summarize detail at lower levels

E.g., Netflix prize (www.netflixprize.com) • 17K movies, 400K users,

100M reviews• E.g., user’s average rating vs.

specific ratings• E.g., movie’s average rating

vs. specific rating

user movie

review

5

rating3.5

avg.rating

“Matrix”

title(reviews…)

4.5

avg.rating

user movie

Page 28: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

Conclusions

• Graph representation of relational data

• Graph-based pattern learning improves understanding of modeled behavior

• Massive, dynamic graphs

• Numerous application domains

• Graph problems computationally and memory intensive

• HPC (data-intensive computing) and multiscale approaches

Page 29: Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.

For More Information

• Larry Holder, School of EECS, WSU Email: [email protected] URL: www.eecs.wsu.edu/~holder

• SUBDUE Source code in C Datasets www.subdue.org

• D. Cook and L. Holder (2006). Mining Graph Data, Wiley. (www.eecs.wsu.edu/mgd)