Parallel Community Detection for MassiveGraphsE. Jason Riedy, Henning Meyerhenke, David Ediger, andDavid A. Bader
14 February 2012
Exascale data analysis
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
• Graph clustering is common in all application areas.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 2/35
These are not easy graphs.Yifan Hu’s (AT&T) visualization of the in-2004 data set
http://www2.research.att.com/~yifanhu/gallery.html
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 3/35
But no shortage of structure...
Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,Science 302, 1722-1736, 2003.
Jason’s network via LinkedIn Labs
• Locally, there are clusters or communities.
• First pass over a massive social graph:• Find smaller communities of interest.• Analyze / visualize top-ranked communities.
• Our part: Community detection at massive scale. (Or kindalarge, given available data.)
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 4/35
Outline
Motivation
Shooting for massive graphs
Our parallel method
Implementation and platform details
Performance
Conclusions and plans
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 5/35
Can we tackle massive graphs now?
Parallel, of course...• Massive needs distributed memory, right?
• Well... Not really. Can buy a 2 TiB Intel-based Dell serveron-line for around $200k USD, a 1.5 TiB from IBM, etc.
Image: dell.com.
Not an endorsement, just evidence!
• Publicly available “real-world” data fits...
• Start with shared memory to see what needs done.
• Specialized architectures provide larger shared-memory viewsover distributed implementations (e.g. Cray XMT).
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 6/35
Designing for parallel algorithms
What should we avoid in algorithms?
Rules of thumb:
• “We order the vertices (or edges) by...” unless followed bybisecting searches.
• “We look at a region of size more than two steps...” Manytarget massive graphs have diameter of ≈ 20. More than twosteps swallows much of the graph.
• “Our algorithm requires more than O(|E|/#)...” Massivemeans you hit asymptotic bounds, and |E| is plenty of work.
• “For each vertex, we do something sequential...” The fewhigh-degree vertices will be large bottlenecks.
Remember: Rules of thumb can be broken with reason.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 7/35
Designing for parallel implementations
What should we avoid in implementations?
Rules of thumb:
• Scattered memory accesses through traditional sparse matrixrepresentations like CSR. Use your cache lines.
idx: 32b idx: 32b ...
val: 64b val: 64b ...idx1: 32b idx2: 32b val1: 64b val2: 64b ...
• Using too much memory, which is a painful trade-off withparallelism. Think Fortran and workspace...
• Synchronizing too often. There will be work imbalance; try touse the imbalance to reduce “hot-spotting” on locks or cachelines.
Remember: Rules of thumb can be broken with reason. Some ofthese help when extending to PGAS / message-passing.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 8/35
Sequential agglomerative method
A
B
C
D
E
FG
• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increases
modularity.• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 9/35
Sequential agglomerative method
A
B
C
D
E
FG
C
B
• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increases
modularity.• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 10/35
Sequential agglomerative method
A
B
C
D
E
FG
C
B
D
A
• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increases
modularity.• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 11/35
Sequential agglomerative method
A
B
C
D
E
FG
C
B
D
A
B
C
• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increases
modularity.• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 12/35
Parallel agglomerative method
A
B
C
D
E
FG
• We use a matching to avoid the queue.
• Compute a heavy weight, largematching.
• Simple greedy algorithm.• Maximal matching.• Within factor of 2 in weight.
• Merge all communities at once.
• Maintains some balance.
• Produces different results.
• Agnostic to weighting, matching...• Can maximize modularity, minimize
conductance.• Modifying matching permits easy
exploration.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 13/35
Parallel agglomerative method
A
B
C
D
E
FG
C
D
G
• We use a matching to avoid the queue.
• Compute a heavy weight, largematching.
• Simple greedy algorithm.• Maximal matching.• Within factor of 2 in weight.
• Merge all communities at once.
• Maintains some balance.
• Produces different results.
• Agnostic to weighting, matching...• Can maximize modularity, minimize
conductance.• Modifying matching permits easy
exploration.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 14/35
Parallel agglomerative method
A
B
C
D
E
FG
C
D
G
E
B
C
• We use a matching to avoid the queue.
• Compute a heavy weight, largematching.
• Simple greedy algorithm.• Maximal matching.• Within factor of 2 in weight.
• Merge all communities at once.
• Maintains some balance.
• Produces different results.
• Agnostic to weighting, matching...• Can maximize modularity, minimize
conductance.• Modifying matching permits easy
exploration.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 15/35
Platform: Cray XMT2
Tolerates latency by massive multithreading.
• Hardware: 128 threads per processor• Context switch on every cycle (500 MHz)• Many outstanding memory requests (180/proc)• “No” caches...
• Flexibly supports dynamic load balancing• Globally hashed address space, no data cache
• Support for fine-grained, word-level synchronization• Full/empty bit on with every memory word
• 64 processor XMT2 at CSCS,the Swiss NationalSupercomputer Centre.
• 500 MHz processors, 8192threads, 2 TiB of sharedmemory
Image: cray.com
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 16/35
Platform: Intel R© E7-8870-based server
Tolerates some latency by hyperthreading.
• Hardware: 2 threads / core, 10 cores / socket, four sockets.• Fast cores (2.4 GHz), fast memory (1 066 MHz).• Not so many outstanding memory requests (60/socket), but
large caches (30 MiB L3 per socket).
• Good system support• Transparent hugepages reduces TLB costs.• Fast, user-level locking. (HLE would be better...)• OpenMP, although I didn’t tune it...
• mirasol, #17 on Graph500(thanks to UCB)
• Four processors (80 threads),256 GiB memory
• gcc 4.6.1, Linux kernel3.2.0-rc5
Image: Intel R© press kit
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 17/35
Implementation: Data structures
Extremely basic for graph G = (V,E)
• An array of (i, j;w) weighted edge pairs, each i, j stored onlyonce and packed, uses 3|E| space
• An array to store self-edges, d(i) = w, |V |• A temporary floating-point array for scores, |E|• A additional temporary arrays using 4|V |+ 2|E| to store
degrees, matching choices, offsets...
• Weights count number of agglomerated vertices or edges.
• Scoring methods (modularity, conductance) need onlyvertex-local counts.
• Storing an undirected graph in a symmetric manner reducesmemory usage drastically and works with our simple matcher.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 18/35
Implementation: Data structures
Extremely basic for graph G = (V,E)
• An array of (i, j;w) weighted edge pairs, each i, j stored onlyonce and packed, uses 3|E| 32-bit space
• An array to store self-edges, d(i) = w, |V |• A temporary floating-point array for scores, |E|• A additional temporary arrays using 2|V |+ |E| 64-bit, 2|V |
32-bit to store degrees, matching choices, offsets...
• Need to fit uk-2007-05 into 256 GiB.
• Cheat: Use 32-bit integers for indices. Know we won’t contractso far to need 64-bit weights.
• Could cheat further and use 32-bit floats for scores.
• (Note: Code didn’t bother optimizing workspace size.)
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 19/35
Implementation: Data structures
Extremely basic for graph G = (V,E)
• An array of (i, j;w) weighted edge pairs, each i, j stored onlyonce and packed, uses 3|E| space
• An array to store self-edges, d(i) = w, |V |• A temporary floating-point array for scores, |E|• A additional temporary arrays using 2|V |+ |E| 64-bit, 2|V |
32-bit to store degrees, matching choices, offsets...
• Original ignored order in edge array, killed OpenMP.
• New: Roughly bucket edge array by first stored index.Non-adjacent CSR-like structure.
• New: Hash i, j to determine order. Scatter among buckets.
• (New = MTAAP 2012)
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 20/35
Implementation: Routines
Three primitives: Scoring, matching, contracting
Scoring Trivial.
Matching Repeat until no ready, unmatched vertex:
1 For each unmatched vertex in parallel, find thebest unmatched neighbor in its bucket.
2 Try to point remote match at that edge (lock,check if best, unlock).
3 If pointing succeeded, try to point self-match atthat edge.
4 If both succeeded, yeah! If not and there wassome eligible neighbor, re-add self to ready,unmatched list.
(Possibly too simple, but...)
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 21/35
Implementation: Routines
Contracting
1 Map each i, j to new vertices, re-order by hashing.
2 Accumulate counts for new i′ bins, prefix-sum for offset.
3 Copy into new bins.
• Only synchronizing in the prefix-sum. That could be removed ifI don’t re-order the i′, j′ pair; haven’t timed the difference.
• Actually, the current code copies twice... On short list forfixing.
• Binning as opposed to original list-chasing enabledIntel/OpenMP support with reasonable performance.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 22/35
Implementation: Routines
Graph name
Tim
e (s)
0%
20%
40%
60%
80%
100%
0%
20%
40%
60%
80%
100%
celegans_metabolic
power
polblogs
PG
Pgiantcom
po
as−22july06
mem
plus
luxembourg.osm
astro−ph
cond−m
at−2005
preferentialAttachm
ent
smallw
orld
G_n_pin_pout
caidaRouterLevel
rgg_n_2_17_s0
coAuthorsC
iteseer
citationCiteseer
belgium.osm
kron_g500−sim
ple−logn16
333SP
in−2004
coPapersD
BLP
eu−2005
ldoor
audikw1
kron_g500−sim
ple−logn20
cage15
uk−2002
er−fact1.5−
scale25
uk−2007−
05
Intel E7−
8870C
ray XM
T2
Primitive
score match contract other
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 23/35
Performance: Time by platform
Graph name
Tim
e (s)
10−210−1100101102103
10−210−1100101102103
●●
● ●●●●●●
●●●●●●
●●●● ●
● ●
●
● ●●
●●●●
●● ●
●●●
●●●● ●● ●●● ●●
●●●
● ●●● ●
●●● ●● ●●●
●●●● ●● ●●● ●●
●●●● ●● ●
●●●
● ●●
● ●●
●●● ●●● ●●● ●●● ●●●
●●●
●●●
●● ● ●●● ● ●●●●●
●●● ●●
● ●●
●●●● ●●
● ●●● ●●●
●
●
●●●● ●●●
●●●
●●●
●●● ●●●
●● ●● ●●
●●● ●● ●
●●●
celegans_metabolic
power
polblogs
PG
Pgiantcom
po
as−22july06
mem
plus
luxembourg.osm
astro−ph
cond−m
at−2005
preferentialAttachm
ent
smallw
orld
G_n_pin_pout
caidaRouterLevel
rgg_n_2_17_s0
coAuthorsC
iteseer
citationCiteseer
belgium.osm
kron_g500−sim
ple−logn16
333SP
in−2004
coPapersD
BLP
eu−2005
ldoor
audikw1
kron_g500−sim
ple−logn20
cage15
uk−2002
er−fact1.5−
scale25
uk−2007−
05
Intel E7−
8870C
ray XM
T2
Number of communities
● 100 ● 101 ● 102 ● 103 ● 104 ● 105 ● 106
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 24/35
Performance: Rate by platform
Graph name
Edges per second
104
105
106
107
104
105
106
107
●●
●
●●●
●●
●●●●
●●
●●●●
● ●
●●
●
●●●
●
●●●●●●
●●●
●●●●●●
●●● ●●
●
●●● ●●●
●
●●
●●● ●●●
●●●
●●●●●●
●●●
● ●●●●● ●●●
● ●●●● ●
●●●
● ●● ●●●●●●
● ●●
●●●
●●
●
●●● ●●● ●●● ●●●●●●
●
●● ●
●
●●●● ●●
●●●● ●●●
●
●
●
● ●● ●●●
●●●
●●●
●●●●●●
● ●●●●● ●●● ●●● ●●●
celegans_metabolic
power
polblogs
PG
Pgiantcom
po
as−22july06
mem
plus
luxembourg.osm
astro−ph
cond−m
at−2005
preferentialAttachm
ent
smallw
orld
G_n_pin_pout
caidaRouterLevel
rgg_n_2_17_s0
coAuthorsC
iteseer
citationCiteseer
belgium.osm
kron_g500−sim
ple−logn16
333SP
in−2004
coPapersD
BLP
eu−2005
ldoor
audikw1
kron_g500−sim
ple−logn20
cage15
uk−2002
er−fact1.5−
scale25
uk−2007−
05
Intel E7−
8870C
ray XM
T2
Number of communities
● 100 ● 101 ● 102 ● 103 ● 104 ● 105 ● 106
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 25/35
Performance: Rate by metric (on Intel)
Graph name
Edges per second
105105.5
106106.5
107
105105.5
106106.5
107
●
●
●
●●●
●●
●●●●
●●
●
●●●
●●
●●
●
● ●●
●
● ●●
●●●
● ●●
●● ●
●●●
●●● ●●
●
● ●
● ●●●
●
●●
●●● ●●●
● ●●
●●●
●● ●●●
●
●●●●● ●
●●●
●●●
●● ●
●●●
●●●
●
●
●
●●●
●●●
●●●
●●●●●● ●●● ●●●
●● ●
●●
●●●
●
●●●
●●●
●● ●
●●●
●●●
●●●●●
●
●●●
●●●
● ●●
●●●●●●
●●●
●● ●
●● ●
●●●
celegans_metabolic
power
polblogs
PG
Pgiantcom
po
as−22july06
mem
plus
luxembourg.osm
astro−ph
cond−m
at−2005
preferentialAttachm
ent
smallw
orld
G_n_pin_pout
caidaRouterLevel
rgg_n_2_17_s0
coAuthorsC
iteseer
citationCiteseer
belgium.osm
kron_g500−sim
ple−logn16
333SP
in−2004
coPapersD
BLP
eu−2005
ldoor
audikw1
kron_g500−sim
ple−logn20
cage15
uk−2002
er−fact1.5−
scale25
uk−2007−
05
cnmcond
Number of communities
● 100 ● 101 ● 102 ● 103 ● 104 ● 105 ● 106
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 26/35
Performance: Scaling
Number of threads / processors
Tim
e (s
)
100.5
101
101.5
102
102.5
103
Intel E7−8870
●
●
●
●
●●
● ● ●●●●
●
●
●
●
●
● ●● ●●
●●
368.0 s
33.4 s
84.9 s
6.6 s
20 21 22 23 24 25 26
Cray XMT2
●
●
●
●● ● ● ●●
●
●
●
●
● ● ● ●●
1188.9 s
285.4 s349.6 s
72.1 s
20 21 22 23 24 25 26
Graph
● uk−2002 ● kron_g500−simple−logn20
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 27/35
Performance: Modularity at coverage ≈ 0.5
Graph name
Modularity 0.0
0.2
0.4
0.6
0.8
●●●
●● ●
●●●
●●●
●●●
●
●
●
● ●●
●●●
●● ● ●●●
●● ●
●● ●
● ●●
●●●
●●●
●●●●●●
●●●
●● ●
●●●●●●
●●●
●●●
●●● ●●●
●●●
●●●●●●
●●
●
●●●
●●●
●●●
●● ●
●●●
●●●
●●●
●●●
● ●●
●●● ●●●
●●●
●●●
●●
●
●●●
●●●●●●
●● ●
●●●
●●●
●●●● ●● ●●●
●●●
●●●● ●●
●●●
●●●●●●
●●●
●●
● ●●
●●●
●
●●
●●●
●●●
●● ●
●● ●
●●●
●●● ●●●
●●●
●●●
●● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●●
●● ●
●●● ● ●●
●●●
●●●
●●●
●●●
xxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx xxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxx xxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
celegans_metabolic
power
polblogs
PG
Pgiantcom
po
as−22july06
mem
plus
luxembourg.osm
astro−ph
cond−m
at−2005
preferentialAttachm
ent
smallw
orld
G_n_pin_pout
caidaRouterLevel
rgg_n_2_17_s0
coAuthorsC
iteseer
citationCiteseer
belgium.osm
kron_g500−sim
ple−logn16
333SP
in−2004
coPapersD
BLP
eu−2005
ldoor
audikw1
kron_g500−sim
ple−logn20
cage15
uk−2002
er−fact1.5−
scale25
uk−2007−
05
scoring
● cnm ● mb ● cond
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 28/35
Performance: Avg. conductance at coverage ≈ 0.5
Graph name
AIX
C
0.0
0.2
0.4
0.6
0.8
●
●●●●●
● ●●
● ●●
●●●
● ●●
● ●● ●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●● ●●●
●●●
●●●
●●●
● ●●
●● ●●●●
●●●
●●●
●●●
●●
●
● ●●
●●●●●
●
●●●
●●●
●●●
●● ●
●● ● ●●●
●●●
●●●
●●●
●● ●
●●
●
● ●●
●●●
●● ●
●●●
●●●●●●
●●●
●●●
●● ●
● ●●
●●● ●●●
●●●
●●●
●●●
●● ●
●●●
●●● ●●●
●
●●
●●●
●●●
●●●
●● ● ●●●
●●
●●●●
● ●●
●●●
●●●
●●●
●●●●●●
● ●●
●●● ●●●
●●●
●●●
●●●
●●●
●● ● ●●●
●●●
●●●
● ●●
● ●●
●●
celegans_metabolic
power
polblogs
PG
Pgiantcom
po
as−22july06
mem
plus
luxembourg.osm
astro−ph
cond−m
at−2005
preferentialAttachm
ent
smallw
orld
G_n_pin_pout
caidaRouterLevel
rgg_n_2_17_s0
coAuthorsC
iteseer
citationCiteseer
belgium.osm
kron_g500−sim
ple−logn16
333SP
in−2004
coPapersD
BLP
eu−2005
ldoor
audikw1
kron_g500−sim
ple−logn20
cage15
uk−2002
er−fact1.5−
scale25
uk−2007−
05
scoring
● cnm ● mb ● cond
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 29/35
Performance: Modularity by step
step
Mod
ular
ity
0.0
0.2
0.4
0.6
0.8
XXX
0 5 10 15 20 25 30 35
Graph
coAuthorsCiteseer eu−2005 uk−2002
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 30/35
Performance: Coverage by step
step
Cov
erag
e
0.0
0.2
0.4
0.6
0.8
XXX
0 5 10 15 20 25 30 35
Graph
coAuthorsCiteseer eu−2005 uk−2002
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 31/35
Performance: # of communities
step
Num
ber
of c
omm
uniti
es
104
104.5
105
105.5
106
106.5
107
XX
X
0 5 10 15 20 25 30 35
Graph
coAuthorsCiteseer eu−2005 uk−2002
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 32/35
Performance: AIXC by step
step
AIX
C
10−0.25
10−0.2
10−0.15
10−0.1
10−0.05
100
X
XX
0 5 10 15 20 25 30 35
Graph
coAuthorsCiteseer eu−2005 uk−2002
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 33/35
Performance: Comm. volume by step
step
Com
mun
icat
ion
volu
me
(cut
wei
ght:
dash
ed)
105.5
106
106.5
107
107.5
108
X
X
X
0 5 10 15 20 25 30 35
Graph
coAuthorsCiteseer eu−2005 uk−2002
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 34/35
Conclusions and plans
• Code: http:
//www.cc.gatech.edu/~jriedy/community-detection/
• First: Fix the low-hanging fruit.• Eliminate a copy during contraction.• Deal with stars (next presentation).
• Then... Practical experiments.• How volatile are modularity and conductance to perturbations?• What matching schemes work well?• How do different metrics compare in applications?
• Extending to streaming graph data!• Includes developing parallel refinement... (distance 2 matching)• And possibly de-clustering or manipulating the dendogram.
10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 35/35