This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Ongoing trend: expansion of memory hierarchy forincreased CPU throughput – e.g., high-bandwidthmemory (HBM) layer on current generation Intel XeonPhis (Knight’s Landing)Can we explicitly design graph computations to effectivelyutilize this layer?We explore a work chunking approach that iterativelybrings in pieces of a large graph to perform local updatesin HBM – we specifically look at the label propagationalgorithm. We find:
Chunking has minimal impact on solution qualityChunking can also decrease time to solution
Primary assumption: the graphs being processed are too largeto fit entirely within MCDRAM
2 / 20
Intel Knight’s Landing (KNL)68-72 cores with High Bandwidth Multi-channel DRAM (MCDRAM)
Source:Intel
Stream Triad Bandwidths
DDR: 90 GB/s
MCDRAM: 450 GB/s
Multiple MCDRAM modes
Cache Mode
Flat Mode
Hybrid Mode
3 / 20
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
Why Label Propagation?
Iterative vertex updates – prototypical of many othergraph computations
Wide usage – community detection, partitioning, otherunsupervised learning problems
Nondeterministic algorithm by design – solution qualitycan vary based on processing methodology
Straightforward to implement via work chunking
5 / 20
Multilevel Memory Label Propagationvia work chunking
1: L← LPChunking(G(V,E),Cnum,Citer)2: for all v ∈ V : L(v)← id(v) . Initialize labels as vertex ids3: while at least one L(v) updates do4: for c = 0 · · · (Cnum − 1) do5: Vc ← Chunk(c, V ), Ec ← 〈v, u〉 ∈ E : v or u ∈ Vc
6: for iter = 1 . . . Citer while one L(v) : v ∈ Vc updates do7: for all v ∈ Vc do in parallel . Random order8: Counts← ∅ . Hash table9: for all 〈v, u〉 ∈ Ec do
Vary number of work/transfer threads to ensure balance
7 / 20
Algorithmic Variants
Baseline Cache
Baseline implementation running in cache mode
Baseline Hybrid
Baseline implementation with hash table allocated inMCDRAM
Graph structure and other data handled by MCDRAMcache
Chunk-HBM
All data explicitly allocated in MCDRAM
Per-chunk graph structure transfered into MCDRAM
All vertex labels static in MCDRAM
8 / 20
Experimental SetupTest System and test graphs
Test System: Bowman at Sandia Labs – each node has aKNL with 68 cores, 96 GB DDR, and 16 GB MCDRAM
Test Graphs:
Network n m davg dmax D̃
LiveJournal 4.8 M 69 M 18 20 K 18Friendster 66 M 1.8 B 27 5.2 K 34Twitter 52 M 2.0 B 37 3.7 M 19Host 89 M 2.0 B 22 3.4 M 23uk-2007 105 M 3.3 B 31 975 K 82wBTER 50 50 M 1.2 B 24 110 K 12wBTER 100 100 M 2.4 B 24 135 K 12
9 / 20
How does chunking impact solution quality?
10 / 20
Convergence and solution qualityFor label propagation and community detection algorithms in general
Defining convergence
True convergence: no more label updates can occur
Looser criteria: fixed iterations, some modularity gain orchange, number of labels, others
We run to true convergence when possible, but fix iterations toenable a parametric study of chunking variables.
Defining solution quality
Standard metrics when no ground truth exists: modularity,conductance, among many others
When ground truth exists: normalized mutual information(NMI) and related measurements
Despite some observed flaws with their usage, we select thestandard measurements of modularity and NMI.
11 / 20
Chunking parametersEvaluating impact of number of chunks and iterations per chunk
Heatmaps of iterations to convergence (left) and impact on finalmodularity (right) – lighter is betterAbout 5× increase in iterations captured in left plot and 2% totalmodularity change in right plotWhile chunking increases iterations to convergence, it has minimalimpact on final solution quality (and actually improves it in severalinstances – LiveJournal, Host, wBTER)
Ran same parametric tests on LFR benchmark(n = 10, 000, k = 15, maxk = 500, t1 = 2, t2 = 1,µ = 0.05 . . . 0.6)Heatmap of iterations to convergence (left) and NMIversus baseline (right)Similar takeaways to real-world test instances
2 3 5
10
20
50
Number of Chunks
1
2
3
5
10
20
50
Ite
r P
er
Ch
un
k
0.980
0.985
0.990
0.995
1.000
0.1 0.2 0.3 0.4 0.5 0.6
Mixing Parameter 'mu'
NM
IBase Chunk_5_5 Chunk_50_50
13 / 20
Can HBM chunking improve time to solution?
14 / 20
Consideration 1: Partitioning Methodology5 iterations per chunk, minimum number of chunks possible (∼5), 40 iterations
Effects of partitioning method on per-iteration speedupvs. baseline (left) and modularity (right)Explicit partitioning demonstrates largest improvements,but at the obvious cost of computing the partition
0.0
0.5
1.0
1.5
2.0
PuLP
VertB
lock
EdgeB
lock
Random
Partitioning Strategy
Per−
iter
Speedup
0.00
0.25
0.50
0.75
1.00P
uLP
VertB
lock
EdgeB
lock
Random
Partitioning Strategy
Modula
rity
Im
pro
vem
ent
15 / 20
Consideration 2: Overlapping Communication
Average speedups across all partitioning methods whileoverlapping communicationNote: when overlapping, we double the number ofchunks; this can lead to greater than 2× relative speedupdue to cache effects on graph data and hash tables
0
1
2
3
4
5
Live
Journ
al
Frie
ndste
r
Tw
itter
Host
uk−
2007
wB
TE
R_50
wB
TE
R_100
Per−
iter
Speedup
16 / 20
Overall: Cache vs. Hybrid vs. Flat modesBest times (in seconds) in each mode for each graph for 40 iter or convergence