Thesis Committee• Prof. Christos Faloutsos (Chair)
• Prof. Tom M. Mitchell
• Prof. Leman Akoglu
• Prof. Philip S. Yu
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 2/106
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 3/106
Mining Large Dynamic Graphs and Tensors
Graphs: Social Networks
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 4/106
Graphs: Purchase History
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 5/106
Graphs: Many More
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 6/106
Properties of Real-world Graphs• Large: many nodes, more edges
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 7/106
2B+ active users
500M+ products
•Dynamic: additions/deletions of nodes and edges
40B+ web pages
5M+ articles
Properties of Real-world Graphs•Rich with Attributes: timestamps, scores, text, etc.
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 8/106
…
…
Matrices for Graphs
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 9/106
0 0
1 1
1 0
0
1 1
Graph Adjacency Matrix
Tensors for Rich Graphs
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 10/106
1
00
0
0
• Tensors: multi-dimensional array
3-order tensor(3-dimensional array)
+ Stars(4-order tensor)
+ Text(5-order tensor)
…
0
Research Goal and Tasks•Goal:
•Tasks◦ T1. Structure Analysis
◦ T2. Anomaly Detection
◦ T3. Behavior Modeling
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 11/106
To Understand Large Dynamic Graphs and Tensors
on User Behavior
Tasks
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 12/106
Structure
Anomaly& Fraud
BehaviorModelContrast
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 13/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
GraphsTriangle Count[ICDM17][PAKDD18][submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
PurchaseBehavior
[IJCAI17]Degeneracy [ICDM16]* [KAIS18]*
Tensors Summarization[WSDM17]
Dense Subtensors[PKDD16][WSDM17]
[KDD17][TKDD18]
Progressive Behavior[WWW18]
* Duplicated
Approaches (Tools) •A1. Distributed or external-memory algorithms
•A2. Streaming algorithms based on sampling
•A3. Approximation algorithms
• and their combinations
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 14/106
Roadmap•Overview
•Completed Work <<◦ T1. Structure Analysis
◦ T2. Anomaly Detection
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 15/106
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 16/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
GraphsTriangle Count[ICDM17][PAKDD18][submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
PurchaseBehavior
[IJCAI17]Degeneracy [ICDM16]* [KAIS18]*
Tensors Summarization[WSDM17]
Dense Subtensors[PKDD16][WSDM17]
[KDD17][TKDD18]
Progressive Behavior[WWW18]
* Duplicated
skip
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis ▪T1.1 Waiting-Room Sampling <<
▪T1.2-T1.3 Related Completed Work
◦ T2. Anomaly Detection
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 17/106Kijung Shin, “WRS: Waiting Room Sampling for Accurate Triangle Counting in Real
Graph Streams”, ICDM 2017
Graph Stream Model•Widely-used data model for graphs
• Sequence of edges◦ graph is given over time as a sequence of edges◦ appropriate for dynamic graphs
• Limited memory◦ cannot store all edges in the stream◦ only samples or summaries◦ appropriate for large graphs
18/106
Sources Destination
T1.1 / T1.2 / T1.3Completed / Proposed
Relaxed Graph Stream Model•Chronological order
◦ edges are streamed in the order that they are created
◦ natural for dynamic graphs
◦ temporal patterns can exist
◦ algorithms can exploit the patterns
19/106
Sources Destination
Created at9:21 AM
Created at9:08 AM
Created at9:02 AM
T1.1 / T1.2 / T1.3Completed / Proposed
Triangles in a Graph•A triangle is 3 nodes connected to each other
• The count of triangles has many applications◦ Community detection, spam detection, query optimization
20/106
• Global triangle count: count of all triangles in the graph
• Local triangle count: count of the triangles incident to each node
3
2
1 2
3
4
1
3
2
1
T1.1 / T1.2 / T1.3Completed / Proposed
Problem Definition•Given:
◦ a sequence of edges in the chronological order
◦ memory budget 𝑘 (i.e., up to 𝑘 edges can be stored)
• Estimate: count of global triangles
• To Minimize: estimation error
21/106T1.1 / T1.2 / T1.3Completed / Proposed
“What are temporal patterns in real graph streams?”
“How can we exploit the patterns for accurate triangle counting?”
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis ▪T1.1 Waiting-Room Sampling
◦ Temporal Pattern <<◦ Algorithm◦ Experiments
▪T1.2-T1.3 Related Completed Work
◦ T2. Anomaly Detection◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
22/106T1.1 / T1.2 / T1.3Completed / Proposed
Time Interval of a Triangle• Time interval of a triangle:
23/106
–arrival order
of its last edge arrival order
of its first edge
arrival order
1 2 3 4 5 6 7 8
Time interval
Time interval: 7 − 2 = 5
T1.1 / T1.2 / T1.3Completed / Proposed
Time Interval Distribution• Temporal Locality:
◦ average time interval is
◦ 2X shorter in the chronological order
◦ than in a random order
24/106
random arrival order
chronological arrival orderrandom order
chronological
order
T1.1 / T1.2 / T1.3Completed / Proposed
Temporal Locality•One interpretation:
◦ edges are more likely to form
◦ triangles with edges close in time
◦ than with edges far in time
•Another interpretation: ◦ new edges are more likely to form
◦ triangles with recent edges
◦ than with old edges
25/106
“How can we exploit temporal locality for accurate triangle counting?”
chronological
order
random
order
T1.1 / T1.2 / T1.3Completed / Proposed
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis ▪T1.1 Waiting-Room Sampling
◦ Temporal Pattern◦ Algorithm <<◦ Experiments
▪T1.2-T1.3 Related Completed Work
◦ T2. Anomaly Detection◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
26/106T1.1 / T1.2 / T1.3Completed / Proposed
Algorithm Overview•∆: estimate of triangle count
•𝑝𝑢𝑣𝑤: probability that triangle (𝑢, 𝑣, 𝑤) is discovered
27/106T1.1 / T1.2 / T1.3Completed / Proposed
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣 𝑢 − 𝑣𝑦
∆← ∆ + 1/𝑝𝑢𝑣𝑦
𝑢|𝑥
𝑢|𝑣
𝑣|𝑥
𝑣|𝑦
(2) Counting Step
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
memory
(1) Arrival Step (3) Sampling Step
𝑢 − 𝑣new edge
Algorithm Overview (cont.)•∆: estimate of triangle count
•𝑝𝑢𝑣𝑤: probability that triangle (𝑢, 𝑣, 𝑤) is discovered
28/106
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣
memory
new edge
(1) Arrival Step
T1.1 / T1.2 / T1.3Completed / Proposed
Algorithm Overview (cont.)•∆: estimate of triangle count
•𝑝𝑢𝑣𝑤: probability that triangle (𝑢, 𝑣, 𝑤) is discovered
29/106
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣 𝑢 − 𝑣𝑥
∆← ∆ + 1/𝑝𝑢𝑣𝑥
discover!
memory
(1) Arrival Step (2) Counting Step
𝑢 − 𝑣new edge
T1.1 / T1.2 / T1.3Completed / Proposed
Algorithm Overview (cont.)•∆: estimate of triangle count
•𝑝𝑢𝑣𝑤: probability that triangle (𝑢, 𝑣, 𝑤) is discovered
30/106
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣 𝑢 − 𝑣𝑦
∆← ∆ + 1/𝑝𝑢𝑣𝑦
discover!
(2) Counting Step
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
memory
(1) Arrival Step
𝑢 − 𝑣new edge
T1.1 / T1.2 / T1.3Completed / Proposed
Algorithm Overview (cont.)•∆: estimate of triangle count
•𝑝𝑢𝑣𝑤: probability that triangle (𝑢, 𝑣, 𝑤) is discovered
31/106
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣 𝑢 − 𝑣𝑦
∆← ∆ + 1/𝑝𝑢𝑣𝑦
𝑢|𝑥
𝑢|𝑣
𝑣|𝑥
𝑣|𝑦
(2) Counting Step
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
memory
(1) Arrival Step (3) Sampling Step
𝑢 − 𝑣new edge
T1.1 / T1.2 / T1.3Completed / Proposed
Goal of Sampling Step• to maximize discovering probability 𝑝𝑢𝑣𝑤
Theorem. Variance of our estimate:
Theorem. Unbiasedness of our estimate:
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 = 𝐵𝑖𝑎𝑠 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
32/106
Var ∆ ≈ σ(𝑢,𝑣,𝑤) (1/𝑝𝑢𝑣𝑤 − 1)
Bias[∆] = Exp ∆ −True count = 0
0
True Count
T1.1 / T1.2 / T1.3Completed / Proposed
Increasing Discovering Prob.
•Recall Temporal Locality:◦ new edges are more likely to form
◦ triangles with recent edges
◦ than with old edges
•Waiting-Room Sampling (WRS)◦ treats recent edges better than old edges
◦ to exploit temporal locality
33/106
“How can we increase discovering probabilities of triangles?”
T1.1 / T1.2 / T1.3Completed / Proposed
Waiting-Room Sampling (WRS)•Divides memory space into two parts
◦ Waiting Room: latest edges are always stored
◦ Reservoir: the remaining edges are sampled
34/106
𝑒79 𝑒78 𝑒77 𝑒76
Waiting Room (FIFO) Reservoir (Random Replace)
𝛼% of budget 100 − 𝛼 % of budget
𝑒80New edge
𝑒61 𝑒7 𝑒18 𝑒25 𝑒40 𝑒1 𝑒28
T1.1 / T1.2 / T1.3Completed / Proposed
WRS: Sampling Steps (Step 1)
35/106
𝒆𝟕𝟔Popped edge
𝑒79 𝑒78 𝑒77 𝒆𝟕𝟔
Waiting Room (FIFO) Reservoir (Random Replace)
𝒆𝟖𝟎New edge
𝑒61 𝑒7 𝑒18 𝑒25 𝑒40 𝑒1 𝑒28
𝒆𝟖𝟎 𝑒79 𝑒78 𝑒77 𝑒61 𝑒7 𝑒18 𝑒25 𝑒40 𝑒1 𝑒28
Waiting Room (FIFO) Reservoir (Random Replace)
T1.1 / T1.2 / T1.3Completed / Proposed
WRS: Sampling Steps (Step 2)
36/106
Popped edge 𝒆𝟕𝟔
𝑒80 𝑒79 𝑒78 𝑒77 𝑒61 𝑒7 𝑒18 𝑒25 𝑒40 𝑒1 𝑒28
𝑒61 𝑒7 𝑒18 𝑒25 𝒆𝟕𝟔 𝑒1 𝑒28
𝑒61 𝑒7 𝑒18 𝑒25 𝑒40 𝑒1 𝑒28
Waiting Room (FIFO)
replace!
store
discard
or or
Reservoir (Random Replace)
T1.1 / T1.2 / T1.3Completed / Proposed
Summary of Algorithm
37/106
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣
memory
new edge
(1) Arrival Step
𝑢|𝑥
𝑢|𝑦
𝑣|𝑥
𝑣|𝑦
𝑢 − 𝑣 𝑢 − 𝑣𝑥
∆← ∆ + 1/𝑝𝑢𝑣𝑥
discover!
(2) Discovery Step
𝑢|𝑥
𝑢|𝑣
𝑣|𝑥
𝑣|𝑦
(3) Sampling Step
Waiting-Room Sampling!
T1.1 / T1.2 / T1.3Completed / Proposed
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis ▪T1.1 Waiting-Room Sampling
◦ Temporal Pattern◦ Algorithm◦ Experiments <<
▪T1.2-T1.3 Related Completed Work
◦ T2. Anomaly Detection◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
38/106T1.1 / T1.2 / T1.3Completed / Proposed
Experimental Results: Accuracy
39/106
•Datasets:
•WRS is most accurate (reduces error up to 𝟒𝟕%)
T1.1 / T1.2 / T1.3Completed / Proposed
Discovering Probability•WRS increases discovering probability 𝑝𝑢𝑣𝑤
•WRS discovers up to 3 × more triangles
40/106
WRS
Triest-IMPR
MASCOT
better
T1.1 / T1.2 / T1.3Completed / Proposed
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis ▪T1.1 Waiting-Room Sampling
▪T1.2-T1.3 Related Completed Work <<
◦ T2. Anomaly Detection
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 41/106
T1.2 Distributed Counting of Triangles•Goal: to utilize multiple machines for triangle counting in a graph stream?
42/106
Sources Workers Aggregators
Broadcast Shuffle
Sources Workers Aggregators
Multicast Shuffle
Tri-Fly [PAKDD18] DiSLR [submitted to KDD]
T1.1 / T1.2 / T1.3Completed / ProposedKijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos, “Tri-Fly:
Distributed Estimation of Global and Local Triangle Counts in Graph Streams”, PAKDD 2018
T1.2 Performance of Tri-Fly and DiSLR•𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 = 𝐵𝑖𝑎𝑠 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
43/106
0
DiSLR
Tri-Fly40X
better
better
40X
30X
T1.1 / T1.2 / T1.3Completed / Proposed
T1.3 Estimation of Degeneracy•Goal: to estimate the degeneracy* in a graph stream?
• Core-Triangle Pattern◦ 3:1 power law between the triangle count and the degeneracy
44/106
*degeneracy: maximum 𝑘 such that a subgraph where every node has degree at least 𝑘 exists.
T1.1 / T1.2 / T1.3Completed / ProposedKijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “Patterns and Anomalies in kCores
of Real-world Graphs with Applications”, KAIS 2018 (previously ICDM 2016)
T1.3 Core-D Algorithm•Core-D: one-pass streaming algorithm for degeneracy
45/106
መ𝑑 = exp(𝛼 ⋅ log(∆) + 𝛽)
Estimated Degeneracy
Estimated Triangle Count
(obtained by WRS, etc.)
Core-D
better
T1.1 / T1.2 / T1.3Completed / Proposed
Structure Analysis of GraphsModels:◦ Relaxed graph stream model
◦ Distributed graph stream model
Patterns: ◦ Temporal locality
◦ Core-Triangle pattern
Algorithms:◦ WRS, Tri-Fly, and DiSLR
◦ Core-D
Analyses: bias and variance
46/106T1.1 / T1.2 / T1.3Completed / Proposed
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 47/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
GraphsTriangle Count[ICDM17][PAKDD18][submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
PurchaseBehavior
[IJCAI17]Degeneracy [ICDM16]* [KAIS18]*
Tensors Summarization[WSDM17]
Dense Subtensors[PKDD16][WSDM17]
[KDD17][TKDD18]
Progressive Behavior[WWW18]
* Duplicated
skip
skip
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis
◦ T2. Anomaly Detection ▪T2.1 M-Zoom <<
▪T2.2-T2.3 Related Completed Work
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
48/106Mining Large Dynamic Graphs and Tensors (by Kijung Shin)Kijung Shin, Bryan Hooi, and Christos Faloutsos, “Fast, Accurate and Flexible Algorithms for
Dense Subtensor Mining”, TKDD 2018 (previously ECML/PKDD 2016)
Motivation: Review Fraud
49/106
Bob’s
Carol’s
Alice’s
Alice
T2.1 / T2.2 / T2.3Completed / Proposed
Fraud Forms Dense Block
50/106
Res
tau
ran
ts
AccountsRestaurants Accounts
Adjacency Matrix
T2.1 / T2.2 / T2.3Completed / Proposed
Problem: Natural Dense Subgraphs
•Question. How can we distinguish them?
51/106
Res
tau
ran
ts
Accounts
Adjacency Matrix
suspicious dense blocksformed by fraudsters
natural dense blocks(core, community, etc.)
T2.1 / T2.2 / T2.3Completed / Proposed
Solution: Tensor Modeling
•Along the time axis…◦ Natural dense blocks are
sparse (formed gradually)
◦ Suspicious dense blocks are dense (synchronized behavior)
• In the tensor model◦ Suspicious dense blocks
become denser than natural dense blocks
52/106
Res
tau
ran
ts
Accounts
T2.1 / T2.2 / T2.3Completed / Proposed
Solution: Tensor Modeling (cont.)•High-order tensor modeling:
◦ any side information can be used additionally
53/106
IP Address Keywords Number of stars
“Given a large-scale high-order tensor, how can we find dense blocks in it?”
T2.1 / T2.2 / T2.3Completed / Proposed
Problem Definition•Given: (1) 𝑹: an 𝑁-order tensor,
(2) 𝝆: a density measure,
(3) 𝒌: the number of blocks we aim to find
• Find: 𝒌 distinct dense blocks maximizing 𝝆
54/106
𝑹 = 𝒌 = 𝟑
, , }{
T2.1 / T2.2 / T2.3Completed / Proposed
Density Measures
•How should we define “density” (i.e., 𝜌)?◦ no one absolute answer
◦ depends on data, types of anomalies, etc.
•Goal: flexible algorithm working well with various reasonable measures◦ Arithmetic avg. degree ρ𝐴◦ Geometric avg. degree ρ𝐺◦ Suspiciousness (KL Divergence) ρ𝑆◦ Traditional Density: ρ𝑇 𝐵 = EntrySum 𝐵 /Vol(B)
- maximized by a single entry with the maximum value
55/106T2.1 / T2.2 / T2.3Completed / Proposed
Clarification of Blocks (Subtensors)
56/106
Res
tau
ran
ts
Accounts
Res
tau
ran
ts
Accounts
• The concept of blocks (subtensors) is independent of the orders of rows and columns
• Entries in a block do not need to be adjacent
T2.1 / T2.2 / T2.3Completed / Proposed
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis◦ T2. Anomaly Detection ▪T2.1 M-Zoom [PKDD 16]
◦ Algorithm <<◦ Experiments
▪T2.2-T2.3 Related Completed Work
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
57/106T2.1 / T2.2 / T2.3Completed / Proposed
Single Dense Block Detection•Greedy search
• Starts from the entire tensor
58/106
5 3 0
4 6 1
2 0 0
1 0 1
0
0
𝜌 = 2.9
T2.1 / T2.2 / T2.3Completed / Proposed
Single Dense Block Detection (cont.)
•Remove a slice to maximize density 𝜌
59/106
5 3 0
4 6 1
2 0 0 𝜌 = 3
T2.1 / T2.2 / T2.3Completed / Proposed
60/106
5 3
4 6
2 0 𝜌 =3.3
•Remove a slice to maximize density 𝜌
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3Completed / Proposed
61/106
5 3
4 6
2 0 𝜌 = 3.6
•Remove a slice to maximize density 𝜌
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3Completed / Proposed
0
1
2
3
4
0 2 4 6 8
Den
sity
Iteration
•Until all slices are removed
62/106
𝜌 = 0
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3Completed / Proposed
•Output: return the densest block so far
63/106
5 3
4 6
2 0 𝜌 = 3.6
Single Dense Block Detection (cont.)
T2.1 / T2.2 / T2.3Completed / Proposed
Speeding Up Process
• Lemma 1 [Remove Minimum Sum First]
Among slices in the same dimension, removing the slice with smallest sum of entries increases 𝜌 most
64/106
12 > 9 > 2
T2.1 / T2.2 / T2.3Completed / Proposed
Accuracy Guarantee
• Theorem 1 [Approximation Guarantee]
65/106
M-Zoom Result Order Densest Block
• Theorem 2 [Near-linear Time Complexity]
# Entries in each mode
𝑶(𝑵𝑴 log 𝑳)
𝝆𝑨 𝑩 ≥𝟏
𝑵𝝆𝑨 𝑩∗
Order # Non-zeros
T2.1 / T2.2 / T2.3Completed / Proposed
Optional Post Process• Local search
◦ grow or shrink until a local maximum is reached
66/106
grow
shrink
𝝆 = 𝟐
𝝆 = 𝟏. 𝟖
𝝆 = 𝟑. 𝟐𝟗
T2.1 / T2.2 / T2.3Completed / Proposed
result of our previous greedy search
Optional Post Process (cont.)• Local search
◦ grow or shrink until a local maximum is reached
67/106
grow
shrink
𝝆 = 𝟑. 𝟐𝟓
𝝆 = 𝟑. 𝟐𝟗 𝝆 = 𝟑. 𝟑𝟑
T2.1 / T2.2 / T2.3Completed / Proposed
Optional Post Process (cont.)• Local search
◦ grow or shrink until a local maximum is reached
68/106
grow
𝝆 = 𝟑. 𝟐𝟗 𝝆 = 𝟑. 𝟑𝟑
shrink
𝝆 = 𝟑. 𝟖
T2.1 / T2.2 / T2.3Completed / Proposed
Optional Post Process (cont.)• Local search
◦ grow or shrink until a local maximum is reached
•Return the local maximum
69/106
𝝆 = 𝟑. 𝟑𝟑
grow
𝝆 = 𝟑. 𝟖
shrink
𝝆 = 𝟑
Local maximum
T2.1 / T2.2 / T2.3Completed / Proposed
Multiple Block Detection
•Deflation: Remove found blocks before finding others
70/106
Find Find Find
Restore
Remove Remove
T2.1 / T2.2 / T2.3Completed / Proposed
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis◦ T2. Anomaly Detection ▪T2.1 M-Zoom [PKDD 16]
◦ Algorithm ◦ Experiments <<
▪T2.2-T2.3 Related Completed Work
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
71/106T2.1 / T2.2 / T2.3Completed / Proposed
Speed & Accuracy
72/106
•Datasets: ….
2X
Density metric: 𝜌𝐺
3X2X
T2.1 / T2.2 / T2.3Completed / Proposed
Density metric: 𝜌𝑆 Density metric: 𝜌𝐴
Discoveries in Practice
11 accountsrevised 10 pages2,305 timeswithin 16 hours
Accounts
Korean Wikipedia
Page
s
Accounts
English Wikipedia
Page
s
8 accountsrevised 12 pages2.5 million times
100%
73/106T2.1 / T2.2 / T2.3Completed / Proposed
Discoveries in Practice (cont.)9 accounts gives 1 product369 reviews withthe same ratingwithin 22 hoursAccounts
App Market(4-order)
a block whose volume = 2andmass = 2 millions
TCP Dump(7-order)
Protocols
100%
100%
74/106T2.1 / T2.2 / T2.3Completed / Proposed
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis
◦ T2. Anomaly Detection ▪M-Zoom
▪T2.2-T2.3 Related Completed Work <<
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 75/106
T2.2 Extension to Web-scale Tensors•Goal: to find dense blocks in a disk-resident or distributed tensor
•D-Cube: gives the same accuracy guarantee of M-Zoom with much less iterations
76/106
Entry sum in slices
Average
100 B nonzerosin 5 hours
T2.1 / T2.2 / T2.3Completed / Proposed 76/106Mining Large Dynamic Graphs and Tensors (by Kijung Shin)Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos,
“D-Cube: Dense-Block Detection in Terabyte-Scale Tensors”, WSDM 2017
T2.3 Extension to Dynamic Tensors•Goal: to maintain a dense block in a dynamic tensor that changes over time
•DenseStream: incrementally computes a dense block with the same accuracy guarantee of M-Zoom
77/106T2.1 / T2.2 / T2.3Completed / Proposed 77/106T2.1 / T2.2 / T2.3Completed / Proposed 77/106Mining Large Dynamic Graphs and Tensors (by Kijung Shin)Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos,
“DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams”, KDD 2017
Anomaly Detection in Tensors•Algorithms:
◦ M-Zoom, D-Cube, and DenseStream
•Analyses: approximation guarantees
•Discoveries:◦ Edit war, vandalism, and bot activities
◦ Network intrusion
◦ Spam reviews
78/106T2.1 / T2.2 / T2.3Completed / Proposed
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 79/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
GraphsTriangle Count[ICDM17][PAKDD18][submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
PurchaseBehavior
[IJCAI17]Degeneracy [ICDM16]* [KAIS18]*
Tensors Summarization[WSDM17]
Dense Subtensors[PKDD16][WSDM17]
[KDD17][TKDD18]
Progressive Behavior[WWW18]
* Duplicated
skipskip
skip
Motivation
80/106
…? ? ?Welcome to
profile
profile
profile
Start Goal
T3.1Completed / Proposed 80/106T2.1 / T2.2 / T2.3Completed / Proposed 80/106T2.1 / T2.2 / T2.3Completed / Proposed 80/106Mining Large Dynamic Graphs and Tensors (by Kijung Shin)Kijung Shin, Mahdi Shafiei, Myunghwan Kim, Aastha Jain, and Hema Raghavan,
“Discovering Progression Stages in Trillion-Scale Behavior Logs”, WWW 2018
Problem Definition•Given:
◦ behavior log
◦ number of desired latent stages: 𝑘
• Find: 𝑘 progression stages◦ types of actions
◦ frequency of actions
◦ transitions to other stages
• To best describe the given behavior log
81/106
UsersAct
ion
typ
es
T3.1Completed / Proposed
Behavior Model•Generative process:
◦ Θ𝑠: action-type distribution in stage 𝑠
◦ 𝜙𝑠: time-gap distribution in stage 𝑠
◦ 𝜓𝑠: next-stage distribution in stage 𝑠
•Constraint: “no decline” (progression but no cyclic patterns)
82/106
𝜓0
Θ1
𝜓1
𝜙1 Θ2 𝜙2
𝜓2
Θ2 𝜙2
𝜓2 𝜓3
Θ3 𝜙3
1 2 31 2 3
1 2 32Welcome to
connect message connectjobs
T3.1Completed / Proposed
Optimization Algorithm•Goal: to fit our model to given data
◦ parameters: distributions (i.e., Θ𝑠, 𝜙𝑠, 𝜓𝑠 𝑠) and latent stages
• repeat until convergence ◦ assignment step: assign latent stages while fixing prob. distributions
◦ update step: update prob. distributions while fixing latent stages
▪e.g., Θ𝑠 ← ratio of the types of actions in stage 𝑠
83/106
12
3 “no decline”→ Dynamic Programming
T3.1Completed / Proposed
Scalability & Convergence• Three versions of our algorithm
◦ In-memory
◦ Out-of-core (or external-memory)
◦ Distributed
84/106
1 trillionactions
in 2 hours
5 latent stages
1015
20
T3.1Completed / Proposed
Progression of Users in LinkedIn
85/106
Build one’sProfile
Onboarding Process
Poke around the service
Grow one’sSocial
Network
Consume Newsfeeds
Join
Have 30 connections
T3.1Completed / Proposed
Completed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 86/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
GraphsTriangle Count[ICDM17][PAKDD18][submitted to KDD]
Anomalous Subgraph
[ICDM16]* [KAIS18]*
PurchaseBehavior
[IJCAI17]Degeneracy [ICDM16]* [KAIS18]*
Tensors Summarization[WSDM17]
Dense Subtensors[PKDD16][WSDM17]
[KDD17][TKDD18]
Progressive Behavior[WWW18]
* Duplicated
skip
skipskip
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis
◦ T2. Anomaly Detection
◦ T3. Behavior Modeling
•Proposed Work <<
•Conclusion
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 87/106
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 88/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
Graphs P1. Triangle Counting in Fully Dynamic Stream
P3. Polarization
Modeling
TensorsP2. Fast and
Scalable Tucker Decomposition
* Duplicated
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 89/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
Graphs P1. Triangle Counting in Fully Dynamic Stream
P3. Polarization
Modeling
TensorsP2. Fast and
Scalable Tucker Decomposition
* Duplicated
P1: Problem Definition•Given:
◦ a fully dynamic graph stream,
▪i.e., list of edge insertions and edge deletions
◦ Memory budget 𝑘
• Estimate: the counts of global and local triangles
• To Minimize: estimation error
90/106
… , , + , , − , , + , , − ,…
P1 / P2 / P3Completed / Proposed
P1: Goal
91/106
Method AccuracyHandle
Deletions?
Triest-FD Lowest Yes
MASCOT Low No
Triest-IMPR High No
WRS Highest No
Proposed Highest Yes
P1 / P2 / P3Completed / Proposed
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 92/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
Graphs P1. Triangle Counting in Fully Dynamic Stream
P3. Polarization
Modeling
TensorsP2. Fast and
Scalable Tucker Decomposition
* Duplicated
P2: Problem Definition• Tucker Decomposition (a.k.a High-order PCA)
◦ Given: an 𝑁-order input tensor 𝑿
◦ Find: 𝑁 factor matrices 𝐴(1)… 𝐴(𝑁) & core-tensor 𝒀
◦ To satisfy:
93/106
≈𝑿 [input]
𝒀
𝐴(3)
𝐴(1)
𝐴(2)
P1 / P2 / P3Completed / Proposed
P2: Standard Algorithms
94/106
Materialized
Input Intermediate Data Output(large & sparse) (small & dense)(large & dense)
Scalability bottleneck
SVD
400GB - 4TB2GB
2GB
P1 / P2 / P3Completed / Proposed
P2: Completed Work
95/106
Input Intermediate Data Output(large & sparse) (small & dense)(large & dense)
•Our completed work [WSDM17]
On-the-fly SVD
Incurs repeated computation
P1 / P2 / P3Completed / ProposedJinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu,
“S-HOT: Scalable High-Order Tucker Decomposition”, WSDM 2017.
P2: Proposed Work
96/106
Input Intermediate Data Output(large & sparse) (small & dense)(small & dense)
• Proposed algorithm
Materialized
On-the-fly
Partially materialize intermediate data!
P1 / P2 / P3Completed / Proposed
P2: Expected Performance Gain•Which part of intermediate data should we materialize?
• Exploit skewed degree distributions!
97/106
% of Materialized Data % o
f S
ave
d C
om
pu
tatio
n
P1 / P2 / P3Completed / Proposed
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 98/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
Graphs P1. Triangle Counting in Fully Dynamic Stream
P3. Polarization
Modeling
TensorsP2. Fast and
Scalable Tucker Decomposition
* Duplicated
P3. Polarization Modeling•Polarization in social networks: division into contrasting groups
99/106
Use of marijuana should be: Legal Illegal
OR
“How do people choose between two ways of polarization?”
change of beliefs
change of edges
P1 / P2 / P3Completed / Proposed
P3. Problem Definition•Given: time-evolving social network with nodes’ beliefs on controversial issues◦ e.g., legalizing marijuana
• Find: actor-based model with a utility function◦ depending on network features, beliefs, etc.
• To best describe: the polarization in data
•Applications:◦ predict future edges
◦ predict the cascades of beliefs
100/106P1 / P2 / P3Completed / Proposed
Proposed Work by Topics
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 101/106
T1. Structure Analysis
T2. AnomalyDetection
T3. BehaviorModeling
Graphs P1. Triangle Counting in Fully Dynamic Stream
P3. Polarization
Modeling
TensorsP2. Fast and
Scalable Tucker Decomposition
* Duplicated
Timeline• Mar-May 2018
◦ P1. Triangle counting in fully dynamic graph streams
• Jun-Aug 2018◦ P3. Polarization modeling
• Sep-Oct 2018◦ P2. Fast and scalable tucker decomposition
• Nov 2018 –April 2019◦ Thesis Writing & Job Application
• May 2019◦ Defense
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 102/106
Roadmap•Overview
•Completed Work◦ T1. Structure Analysis
◦ T2. Anomaly Detection
◦ T3. Behavior Modeling
• Proposed Work
•Conclusion <<
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 103/106
Conclusion•Goal:
To Understand Large Dynamic Graphs and Tensors
• Subtasks: ◦ structure analysis
◦ anomaly detection
◦ behavior modeling
•Approaches:◦ distributed or external-memory algorithms
◦ streaming algorithms based on sampling
◦ approximation algorithms
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 104/106
References (Completed work)[1] Kijung Shin, Bryan Hooi, and Christos Faloutsos, “M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees”, ECML/PKDD 2016
[2] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “CoreScope: Graph Mining Using k-Core Analysis -Patterns, Anomalies and Algorithms”, ICDM 2016
[3] Kijung Shin, “Mining Large Dynamic Graphs and Tensors for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017
[4] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu, “S-HOT: Scalable High-Order Tucker Decomposition”, WSDM 2017
[5] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “D-Cube: Dense-Block Detection in Terabyte-Scale Tensors”, WSDM 2017
[6] Kijung Shin, Euiwoong Lee, Dhivya Eswaran, and Ariel D. Procaccia, “Why You Should Charge Your Friends for Borrowing Your Stuff”, IJCAI 2017
[7] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos, “DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams”, KDD 2017
[8] Kijung Shin, Bryan Hooi, and Christos Faloutsos, “Fast, Accurate and Flexible Algorithms for Dense Subtensor Mining”, TKDD 2018
[9] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos, “Patterns and Anomalies in k-Cores of Real-world Graphs with Applications”, KAIS 2018
[10] Kijung Shin, Mahdi Shafiei, Myunghwan Kim, Aastha Jain, and Hema Raghavan, “Discovering Progression Stages in Trillion-Scale Behavior Logs”, WWW 2018
[11] Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos. “Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, and Christos Faloutsos. PAKDD 2018.” PAKDD 2018
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 105/106
Thank You• Papers, software, data: http://www.cs.cmu.edu/~kijungs/proposal/
• Email: [email protected]
• Thanks to:◦ Sponsors:
◦ Admins:
◦ Collaborators:
Mining Large Dynamic Graphs and Tensors (by Kijung Shin) 106/106