Page 1
Scalable Dynamic Graph Summarization
Ioanna Tsalouchidou 1 Gianmarco De Francisci Morales 2
Francesco Bonchi 3 Ricardo Baeza-Yates 1
1Web Research Group, DTICPompeu Fabra University, Spain
2Qatar Computing Research Institute
3Algorithmic Data Analytics LabISI Foundation, Turin, Italy
IEEE International Conference on Big Data, 2016
Page 2
IntroductionMethodologyExperimentsConclusions
Table of Contents
1. Introduction– Motivation– Related Work– Our approach
2. Methodology– Baseline algorithm– MicroClustering algorithm
3. Experiments
4. Conclusions
2
Page 3
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Introduction to Big Graphs
Big Data in social, communication, biological networksetc.
Are represented by Big Graphs
Encode relationship and communication patterns betweenpeople, news, trends, proteins etc.
3
Page 4
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
4
Page 5
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
4
Page 6
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
0.3
0.7
4
Page 7
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
0.5
0.9
4
Page 8
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
4
Page 9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem - Solution
Problem
Store and process biggraphs
Their evolution in time inmain memory
Applying algorithms iscomputationally expensive
Aggregate vertices andedges to reduce the size
Supernode: a set ofvertices of the original graph
Superedge: an edgebetween two supernodes
5
Page 10
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem - Solution
Problem Solution
Store and process biggraphs
Their evolution in time inmain memory
Applying algorithms iscomputationally expensive
Aggregate vertices andedges to reduce the size
Supernode: a set ofvertices of the original graph
Superedge: an edgebetween two supernodes
5
Page 11
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Related Work
Graph Summarization:
GraSS: Graph structure summarization [LeFevre and Terzi, ’10]
Graph summarization with quality guarantees [Riondato et al., ’14]
Data stream clustering:
A framework for clustering evolving data streams [Aggarwal etal., ’03]
6
Page 12
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Background: Static graph summarization
Represent graphs as adjacency matrices
Minimize the reconstruction ErrorQuality guaranties: geometric clustering of the nodes
Static Graph:
Adjacency matrix:
7
Page 13
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Background: Static graph summarization
Represent graphs as adjacency matricesMinimize the reconstruction ErrorQuality guaranties: geometric clustering of the nodes
Static Graph:
Adjacency matrix:=⇒
Summary Graph:
Summary adjacency matrix:
7
Page 14
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation: Tensor summarization
Time series of w static graphs
The graph time series is represented by an adjacency tensor
Summary represented by an adjacency matrix
Νode1
ΝodeN
w
N
N
Super Node1
k
k
8
Page 15
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t0
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 16
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t1
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 17
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t2
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 18
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t3
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 19
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t4
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 20
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t4
w
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 21
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t5
wSuper Node1
k
k
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
Page 22
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
At each time-stamp :
Input: most recent adjacency matrix
Update of the sliding window
Clustering nodes to supernodes
Output: one summary at every time-stamp
10
Page 23
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Contributions
Introduce the problem of lossy dynamic graph summarization
Two online algorithms for summarizing dynamic, large-scalegraphs
Distributed, scalable algorithms, implemented in Apache Spark
11
Page 24
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
Baseline algorithm: kC
Νode1
ΝodeN
wN
N
S0
Super-nodes
SC-1
Data PointsCluster each node of the tensor tothe supernodes
Each node has wN values
Clustering N points at everytime-stamp
Problem: (w − 1)N2 values remainunchanged
12
Page 25
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
Page 26
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
Page 27
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
Page 28
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
Page 29
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
Page 30
IntroductionMethodologyExperimentsConclusions
Datasets and Experimental Setup
Datasets:
Twitter hashtag co-occurrences
Yahoo! Network Flow
Synthetic Dataset
Environment:
Cluster of 400 cores distributed in 30 machines.
Each machine: 24 cores Intel(R) Xeon(R) CPU E5-2430 0 @2.20 GHz.
Memory: driver program 12GB, executor process 3GB.
14
Page 31
IntroductionMethodologyExperimentsConclusions
Scalability
15
Page 32
IntroductionMethodologyExperimentsConclusions
Reconstruction Error
16
Page 33
IntroductionMethodologyExperimentsConclusions
Conclusions
Problem: Large, evolving graphs are difficult to store andprocess
Solution: Graph summarization, reduces the size andcaptures the evolution of the input graph
Evaluation: Scalable, distributed solution with small error
17
Page 34
Scalable Dynamic Graph Summarization
Ioanna Tsalouchidou 1 Gianmarco De Francisci Morales 2
Francesco Bonchi 3 Ricardo Baeza-Yates 1
1Web Research Group, DTICPompeu Fabra University, Spain
2Qatar Computing Research Institute
3Algorithmic Data Analytics LabISI Foundation, Turin, Italy
IEEE International Conference on Big Data, 2016