ALGORITHMS AND SOFTWARE FOR THE ANALYSIS OF LARGE COMPLEX NETWORKS
CHRISTIAN STAUDT
Verteidigung der
Dissertation
Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT)
OUTLINE & CONTRIBUTIONS
2
data analysis• inspection
• transformation
• modeling
main algorithmic contributions• parallel heuristics for community detection
[Staudt, Meyerhenke ‘16, in TPDS] • network sparsification via edge centrality rating
[Hamann, Lindner, Meyerhenke, Staudt, Wagner ‘16, in SNAM]
• realistic synthetic networks via generative models [with Gutfraind, Safro, Hamann, Meyerhenke, unpublished]
software package• NetworKit
[Staudt, Sazonovs, Meyerhenke ‘16, in Network Science]
|{z
}
network science• analysis of relational
data known and novel
algorithms by numerous contributors
+
graph algorithmics
network science
software engineering
data analysis / data science
algorithmics
this thesis
INTRODUCTIONPart I
NETWORK SCIENCE
4
�
�
��
��
�
�
�
��
��
��
��
��
�
��
����
��
��
��
����
��
�� ��
��
��
��
��
��
��
��
��
��
��
�� ��
����
����
��
����
��
�
���
��
��
����
��
��
��
����
��
����
��
��
visualization: node-edge diagram
graph representationobservation of phenomenon
phenomenon network concept network dataabstraction representation
network model
network model formation [Brandes et al. 13]
abstraction to network concept
V = NE = {(u, v) 2 D : x(u, v) 62 {0,1}}
!(u, v) = x(u, v) if (u, v) 2 E
example: Dolphin social network study [Lusseau et al. ‘03]
G = (V,E,!)
|V | = n, |E| = m
network analysisinterpretation network structure information
N =?
8(u, v) 2 D = N ⇥N , x(u, v) =?
NETWORKIT: A TOOL SUITE FOR THE ANALYSIS OF LARGE COMPLEX NETWORKS
Part II
NETWORKIT
6
contribution• NetworKit, an open-source tool suite for the analysis of large networks
[Staudt, Sazonovs, Meyerhenke ’16 in Network Science, to appear]
state of the art & targeted improvement• various graph processing or network analysis libraries exist (e.g. Boost Graph, JUNG, NetworkX… ) • increase performance, parallelism, focus on network science, … • provide short path from algorithms research to data analysis applications
PRINCIPLES AND ARCHITECTURE
7
architecture overview
design goals• performance • usability and integration
algorithm and implementation patterns
• parallelism • heuristics and approximation
algorithms • modular design
C++ / OpenMP
Data Structures I/OAlgorithms
Cython
PythonTask-oriented Interface Additional
Functionality
Pythonized Classes
Wrapper Classes
NetworKit
pandas
numpy
matplotlib
ext. Python modules
Python shell / program
USE CASES
8
e.g. as component in a protein network analysis pipeline [Flick ‘14]
as algorithm library
e.g. network profiles for explorative network analysis, here: connectome network
[Staudt et al. ‘16]
as interactive data analysis tool (e.g. via Jupyter Notebook)
COMPARISON AND EVALUATION
9
comparative benchmark• consistently fastest average processing rate for typical analysis kernels on varied set of networks
COMMUNITY DETECTION IN COMPLEX NETWORKS
Part III
COMMUNITY DETECTION
11
0
1
2
3
4
5
6
7
8 9
10
11
12
13
14
15
16
17
18
1920
21
22
23
24
2526
2728
29
30 31
32
33
34
35
36
37
38
39
40
41
4243
44
45
46
47
48
49
50
51
52
53
54
55
5657
58
59
60
61
M�(G, ⇣, �) :=X
C2⇣
|E(C)||E|
| {z }coverage
�� ·X
C2⇣
�Pv2C deg(v)
�2
(2 · |E|)2| {z }
expected coverage
(multi-resolution) modularity ([Lambiotte ’10]) [Girvan & Newman ‘02]
community detection• reveal modular composition of network by finding
internally dense, externally sparse subgraphs (communities)
• community: vague concept, formalized via e.g. objective function modularity
modularity• optimization NP-hard [Brandes et al. ‘08] • parameter-free, well understood -> commonly
applied for explorative network analysis
Dolphins social network - node color by modularity-based communities
ENGINEERING PARALLEL ALGORITHMS FOR COMMUNITY DETECTION
12
contribution• two fast, robust & scalable parallel heuristics: PLP & PLM • PLM: best quality/running time tradeoff in experimental comparison with current competitors
[Staudt & Meyerhenke ’13 at ICPP] [Staudt & Meyerhenke ’16 in TPDS]
state of the art & targeted improvement• O(m)-time heuristics for modularity maximization exist, but few parallel algorithms (10th DIMACS
Challenge 2012 [Bader et al. ’13]) • desired: implementation that robustly processes billion-edge graphs on typical multicore computer in
minutes
PLM: PARALLEL LOUVAIN METHOD
13
predecessor• sequential locally greedy multilevel algorithm
[Blondel et al. ‘08] • optional extension: additional refinement phase
(PLMR) [Rotta & Noack ‘11]
our improvement• first parallelization
move phase
coarsening phaseprolongation
(+ refinement phase)
PLM: PARALLELIZATION
14
parallelization issues and solutions• lock-free parallelisation where possible • move/refinement phase
• race conditions? yes, but mitigated by self-correcting iterative algorithm
• locks only on update of the volume of communities
ENGINEERING PARALLEL ALGORITHMS FOR COMMUNITY DETECTION
15
PLM: strong scaling on 3 billion edge web graph
Pareto evaluation in comparison with codes from 10th DIMACS challenge [Bader et al.
’13] and others
experimental evaluation• running time and solution quality (modularity) measured on diverse set of networks • performance and scaling behavior optimized through algorithm engineering
−0.25 −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10
PRGulaULty scRUe
0
1
2
4
8
16
32
64
128
256
512
tLP
e s
cRUe
3L0
3L0*
3L3
3L05CL8BTBB
5*
LRuvaLn
E33
C**CC**CL
CEL
1 2 4 8 16 32
threads
0200400600800
1000120014001600
tim
e [
s]
1 2 4 8 16 32
threads
0
2
4
6
8
10
speedup
EDGE CENTRALITY MEASURES FOR NETWORK SPARSIFICATION
Part IV
EXAMPLES
17
SPARSIFICATION
18
contributions • conceptual framework: network sparsification as edge centrality rating and filtering • first comparative study of various edge centrality measures for the purpose of structure-preserving network
sparsification • an effective novel method • (parallelized) efficient implementations
[Lindner, Staudt, Hamann, Meyerhenke, Wagner ’15 at ASONAM][Hamann, Lindner, Meyerhenke, Staudt, Wagner ’16 in SNAM]
state of the art & targeted improvement• numerous sparsification methods proposed, but lack of comparative work and unifying concepts
(edge) sparsification• reduce the edge set of a network while preserving important structural properties
edge score calculation edge filteringG = (V,E) G0 = (V,E0)
target edge ratioedge centrality measure
EVALUATION
19
conclusions • class of methods (Simmelian Backbones, Jaccard Similarity, Algebraic Distance) that effectively preserves
community structures • novel method Local Degree preserves shortest paths, connectivity, many centralities • local filtering improves the preservation of almost all properties (diameter, centralities,…)
experimental evaluation • quantify how structural properties vary with decreasing edge ratio • network set: >100 social graphs (Facebook and others)
GENERATIVE MODELS FOR REALISTIC SYNTHETIC NETWORKS
Part V
GENERATING SCALED REPLICAS OF REAL-WORLD NETWORKS
21
contributions• fitting schemes for a variety of generative models • experimental study clarifying the degree of realism of various models • LFR+ generator, an effective tool for creating realistic (scaled) replicas of networks
[current joint work with Gutfraind, Safro, Meyerhenke, Hamann, unpublished]
state of the art• large variety of generative models with varying degrees of (claimed) realism
motivation• algorithm engineering: given a small real network, generate realistic (scaled) replicas to enable representative
experiments on larger data sets
model fitting scheme generator(scaling factor)
Ooriginal network
(scaled) replica
model parameters
network analysis
R
EXAMPLE REPLICATION
22
original dolphins social network
consider key structural features…• degrees • connectedness • clustering • community structure • …
models (& generator algorithms)• [Erdös & Renyi ’60]([Batagelj & Brandes ‘05]) • [Barabasi & Albert ‘02] • [Chung et al. ‘00] • Edge-Switching Markov Chain
[Milo et al. ’03] • RMAT [Chakrabarti et al. ‘04] • Hyperbolic Unit-Disk Graph
[Krioukov et al. ’10]([Looz et al ’15]) • BTER [Kolda et al. ‘13] • LFR [Lancichinetti et al. ‘08]
LFR+ GENERATOR: EXAMPLE REPLICATION
23
epidemiological contact network used in HIV research [Potterat et al.
’02]
scale-2 replica produced by LFR+ generator
sample from scale-200k replica produced by LFR+ generator
SUMMARY
24
all contributions• Part II
• NetworKit[Staudt, Sazonovs, Meyerhenke ‘16, in Network Science, to appear]
• network analysis on distributed systems [Koch, Staudt, Vogel, Meyerhenke ’15 at FAB]
• Part III • parallel heuristics for community detection
[Staudt, Meyerhenke ‘13, at ICPP][Staudt, Meyerhenke ‘16, in TPDS]
• heuristics for selective community detection [Staudt, Marrakchi, Meyerhenke ’14 at IEEE BigData]
• Part IV• network sparsification via edge centrality
rating [Lindner, Staudt, Hamann, Meyerhenke, Wagner ‘15, at ASONAM][Hamann, Lindner, Meyerhenke, Staudt, Wagner ‘16, in SNAM]
• Part V• realistic synthetic networks via generative
models [with Gutfraind, Safro, Hamann, Meyerhenke, unpublished]
graph algorithmics
network science
software engineering
data analysis / data science
algorithmics
this thesis
Appendix
CHARACTERIZING THE STRUCTURE OF NETWORKS
26
distance• e.g. diameter, algebraic distance
node centrality• e.g. degree, betweenness, PageRank
edge centrality• e.g. edge betweenness, -> sparsification
partitioning• e.g. components, k-cores, communities
correlations• e.g. assortativity
emergent properties• e.g. epidemic spreading
0
1
2
3
4
5
6
7
8 9
10
11
12
13
14
15
16
17
18
1920
21
22
23
24
2526
2728
29
30 31
32
33
34
35
36
37
38
39
40
41
4243
44
45
46
47
48
49
50
51
52
53
54
55
5657
58
59
60
61
0
1
2
3
4
5
6
7
8 9
10
11
12
13
14
15
16
17
18
1920
21
22
23
24
2526
2728
29
30 31
32
33
34
35
36
37
38
39
40
41
4243
44
45
46
47
48
49
50
51
52
53
54
55
5657
58
59
60
61
Dolphins social network - betweenness Dolphins social network - k-core decomposition
NETWORKIT API: CODE EXAMPLE
27
FUNCTIONALITY
28
module description example componentscentrality node centrality measures centrality.Betweennesscommunity community detection community.PLMcomponents connected components components.ConnectedComponentscorrelation correlations correlations.Assortativitydistance distance measures distance.Diametergenerators generative models generators.ErdosRenyiGeneratorgraph graph API graph.Graph
linkprediction link prediction algorithms linkprediction.JaccardIndexprofiling network profiling tool profiling.Profilesimulation simulations simulation.EpidemicSimulationSEIR
examples of NetworKit’s modules and components
NETWORKIT IN COMPARISON WITH DISTRIBUTED GRAPH PROCESSING FRAMEWORKS
29
conclusions• good scaling can be due to distributing own
overheads [McSherry et al ’15] • distributed frameworks have significant overheads,
so their application should be motivated by lack of memory on a single node
Staudt – Complex Network Analysis on Distributed Systems 9
Overview | Programming Models & Frameworks
Distributed Computing
General-Purpose Graph-Specific
MapReduce PACT Vertex-Centric
Graph-Centric
Pregel GAS
ApacheGiraph
GraphLab Giraph++Hadoop Apache Flink
distributed (graph) computing models & frameworks
24.9100.585.252.750.3
0306090120
1 2 4 8
0.54
1.42
1.97 2.01
2.99
0
1
2
3
4
NetworKit GraphLab
flickr-edges
159 122148.68880.75 65
69
347
51
1
10
100
1000
Giraph Giraph++ Flink
16.7
4857
41
87
0
30
60
90
120
NetworKit GraphLab
livejournal-links-u
698 633322233
1411
1
10
100
1000
10000
Giraph Giraph++ Flink
24.9
100.5
85.2
52.7 50.3
0
30
60
90
120
NetworKit GraphLab
orkut-links
1462 1318648
179
2511
1
10
100
1000
10000
Giraph Giraph++ Flink
27.2
140
0
30
60
90
120
150
NetworKit GraphLab
uk-2002
1856
1
10
100
1000
10000
Giraph
258
0
100
200
300
GraphLab
1856
1
10
100
1000
10000
Giraph
wikipedia-links
0.091.417202.815282.21224262.2624240.010.1110100
1 2 4 8
0.09
1.4
17 20
2.8
1528
2.2
1224 26
2.26
24 24
0.01
0.1
1
10
100
NetworKit GraphLab Giraph Giraph++ Flink
flickr-edges
0.2
2477 5274 79
45 3563
1438 3647
11
6434
0.01
0.1
1
10
100
NetworKit GraphLab Giraph Giraph++ Flink
livejournal-links-d
12
244
1
10
100
Giraph Giraph++
uk-2002
41
0
25
50
Giraph
twitter-l
nodes
community detection via label propagation
PLP: PARALLEL LABEL PROPAGATION
30
predecessor• sequential label propagation algorithm [Raghavan
et al. ‘07] • a local coverage maximizer (implicitly maximizes
modularity by getting stuck in local optima)
our improvement• parallelization & optimizing heuristics
SELECTIVE COMMUNITY DETECTION
31
s1
s2
s3
task- given a set of seed nodes, detect the communities that contain them
state of the art & targeted improvement- plethora of objective functions and heuristics proposed - lack of comparative work
contributions- comparative study clarifying the real-world performance of existing and
novel algorithms - Greedy Community Expansion: generic greedy algorithm for the
incremental maximization of different objective functions, subsuming several previous efforts
- SelSCAN: application of density-based clustering to SCDs
C
s
o1
o2
c1
c2
c3
v1
density-based community
EDGE CENTRALITY MEASURES FOR SPARSIFICATION
32
methods- Random Edge (RE) - Triange Count (Tri) - Jaccard Similarity (JS) [Satuluri et
al. ’11] - (Quadrilateral or Triadic) Simmelian
Backbones (TS, QLS) [Nick et al. ’13][Nocaj et al. ’14]
- Edge Forest Fire (EFF) [Leskovec et al. ’06]
- Algebraic Distance (AD) [Chen & Safro ’11]
- Local Degree (LD)
edge rank correlations
EDGE CENTRALITY: LOCAL DEGREE
33
motivation• hierarchy of hub nodes (with
high degree) is important for real-world network’s connectivity (esp. shortest paths)
measure• retain edges to top
neighbors by degree • time: [http://rocs.northwestern.edu/]O(m log(d
max
))
bdeg(u)↵c
experimental results• preserves wide range of properties (connectivity, shortest paths, centralities, epidemic simulation
behavior) • can be strongly correlated with edge betweenness - depending on the network • (edge betweenness: time)O(nm)
SPARSIFICATION BY LOCAL FILTERING
34
edge score calculation edge filteringG = (V,E) G0 = (V,E0)
target edge ratioedge centrality measure
score transformationbdeg(u)↵c, ↵ 2 [0, 1]keep top
edges incident to every node…
… by applying global threshold to transformed
scores lc({u, v})
c({u, v})
advantages• improves preservation of all
properties (by preserving connectivity)
• accomodates structurally heterogeneous regions in networks
Quadrilateral Simmelian Backbone (QLS) [Nocaj et al.
’14]
Quadrilateral Simmelian Backbone
with local filtering (LQLS)
REALISTIC SCALING OF NETWORKS
35
LFR+ ALGORITHM
36
predecessor• LFR generator for community detection
benchmarking [Lancichinetti, Fortunato, Radicchi ’08]
our modification• utilize core algorithm but accept more
general input, fitted to original graph • -> increased flexibility and realism of the
replica
⇣communities
degrees
(�⇣(u))u2V
intra-community degrees
1. assign degrees to nodes 2. assign sizes to communities 3. assign nodes randomly & iteratively to
communities so that intra-community degrees are satisified
4. connect graph using Edge-Switching Markov Chain Generator model (one
graph per community, one global graph, rewiring step to remove additional intra-
community edges)
d = (deg(u))u2V
core algorithm
degree distribution with power-law exponent �
community size distribution with power-law exponent
mixing parameter µ
�
Ooriginal graph
vanilla LFR parameters LFR+ parameters
synthetic graph R
STRUCTURE REPLICATION
37
RUNNING TIME REPLICATION
38
Are algorithm running times obtained on synthetic graphs representative for those on real-world inputs?