NUMA-aware Scalable Graph Traversal on SGI UV Systems
Post on 17-Feb-2017
194 Views
Preview:
Transcript
NUMA-aware scalable graph traversalon SGI UV systems
*Yuichiro Yasui, Katsuki FujisawaKyushu University
Eng Lim Goh, John Baron, Atsushi SugiuraSGI Corp.
Takashi UchiyamaSGI Japan, Ltd.
HPGP’16 @ Kyoto at May 31, 2016
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Our motivations
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– improves locality of reference in memory accesses
– exploits multithreading on many-socket system (SGI UV2000, UV300)
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Local access
Remote access
・・・
Many-socket systemRepresents many relationships by graph structure
CPU
RAM
CPU
RAM
CPU
RAM
…
Partial CSR graphassigned
Kronecker graph w/ SCALE 3417 billion nodes and 275 billion edges
SGI UV 30032-sockets 18-core Xeon and 16 TB RAM
Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction ofremote edges
UV 300 w/ 32 sockets
219 GTEPS (NEW)
New result !!
Updated highest score
On single-node
Binding on NUMA node
Graph processing for Large scale networks
• Large-scale networks are generated in widely appli. areas
– US Road network: 58 million edges
– Twitter follow-ship: 1.47 billion edges
– Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
US road network
24 million vertices & 58 million edges15 billion log entries / day
Social network
• Fast and scalable graph processing with HPC
– categorized as data intensive application
large
61.6 million vertices
& 1.47 billion edges
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Relationships
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
graph
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortest path
・Maximum flow ・Maximal independent set
・Centrality metrics ・Clustering ・Graph Mining
• One of most important and fundamental algorithm to traverse graph structures
• Many algorithms and applications based on BFS (Max. flow and Centrality)
• Well-known algorithm takes O(n+m) for a digraph G with n vertices and m edges
Breadth-first search (BFS)
Source
BFSLv. 3
Source Lv. 2Lv. 1
Outputs• BFS tree
• Distance
Inputs• digraph G = (V, E)• Source vertex
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Relationships
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
graph
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortest path
・Maximum flow ・Maximal independent set
・Centrality metrics ・Clustering ・Graph Mining
BFS Tree
BFS on Twitter follow-ship network
• follow-ship network
– #Users (#vertices): 41,652,230
– Follow-ships (#edges): 2,405,026,092
Lv. #users ratio (%) percentile (%)0 1 0.00 0.00
1 7 0.00 0.00
2 6,188 0.01 0.01
3 510,515 1.23 1.24
4 29,526,508 70.89 72.13
5 11,314,238 27.16 99.29
6 282,456 0.68 99.97
7 11536 0.03 100.00
8 673 0.00 100.00
9 68 0.00 100.00
10 19 0.00 100.00
11 10 0.00 100.00
12 5 0.00 100.00
13 2 0.00 100.00
14 2 0.00 100.00
15 2 0.00 100.00
Total 41,652,230 100.00 -
BFS result from User 21,804,357
� excluding unconnected users
Six-degrees of separation
“everyone and everything is six or fewer steps away”
Ours: 60 milliseconds per BFS
Twitter2009
Graph500 and Green Graph500
• New benchmarks using graph processing (breadth-first search)
• measures a performance and energy efficiency of irregular memory access
TEPS score (# of Traversed edges per second) for
Measuring a performance of irregular memory accesses
TEPS per Watt score for measuring
power-efficient performamce
Graph500 benchmark Green Graph500 benchmark
1. Generation
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
Median TEPS
SCALE & edgefactor (=16)
Kronecker graph with 2SCALEvertices and
2SCALE×edgefactor edgesby using SCALE-times the recursive
Kronecker productsG1 G2 G3 G4
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS x 64
x 64
Median of
64 TEPSs
Power
consumption
Power consumption
in watt
TEPS per Watt
NUMA (Non-uniform memory access) system
NUMA 0
NUMA 1 NUMA 2
NUMA 3
0
1
2
3
0 1 2 3
targ
et N
UM
A no
de
source NUMA node
24.2
3.4
3.0
3.4
3.3
23.9
3.5
3.0
3.0
3.4
24.3
3.4
3.5
3.0
3.4
24.2
Local access: 24 GB/s
Remote access: 3 GB/s
NUMA 0
NUMA 1 NUMA 2
NUMA 3
DataThreads
NUMA system w/ 4-sockets
Datathreads
Fast local access Slow non-local access
Differentdistances
CP
U
RA
M
CP
U
RA
M
CP
U
RA
M
CP
U
RA
M
Thread placement
Me
mo
ry p
lace
me
nt
(Example) 4-socket Xeon system
• 4 (# of CPU sockets)
• 8 (# of physical cores per socket)
• 2 (# of threads per core)
diagonal elements
= local access
SGI UV 2000• UV 2000
– Single OS: SUSE Linux 11 (x86_64)
– hypercube interconnection
– Up to 2,560 cores and 64 TB RAM
– (= 128 UV 2000 shassis x 2 sockets x 10 cores)
ISM has two full-spec. UV 2000
• Hierarchical network topologies
– Sockets, Chassis, Cubes, Inner-racks, and Outer-racks
UV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes
CPU
RAM
CPU
RAM
× 4 = NUMAlink6
6.7GB/s
� Cannot detect
NUMA
node
ISM Kyushu U.
RAM CPU
to other chassis
to other chassis
SGI UV 300• UV 300
– Single OS: SUSE Linux 11 (x86_64)
– All-to-all interconnection
– Up to 1,152 cores and 16 TB RAM
– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)
• UV 300 chassis
– 4-socket 18-core Intel Xeon E7-8867 (Haswell)
– 2TB RAM (512 GB per NUMA node)
UV300 chassis
UV300 Rack
All-to-All • 18-core Xeon E7-8867
• HT enabled (2 SMT)
• 512GB RAM
NUMA node
UV 300 chassis
Kyushu U.
8 chassis
Memory Bandwidths on UV 2000 and UV 300
• Bandwidths in GB/s b/w NUMA nodes using STREAM TRIAD
• Local access is clearly faster than remote access
0
16
32
48
0 16 32 48
Mem
ory
Plac
emen
t
Thread Placement
0
5
10
15
20
25
30
35UV 2000 (64 sockets)
3-7 GB/s
3-7 GB/s
Each chassis has 2 sockets and connects to
each other by hypercube topology
Local33 GB/s
UV 2000chassis
0
4
8
12
16
20
24
28
0 4 8 12 16 20 24 28
Mem
ory
Plac
emen
t
Thread Placement
0
10
20
30
40
50
60UV 300 (32 sockets)
Each chassis has 4 sockets and connects to
each other by all-to-all topology
6 GB/s
6 GB/s
Local56 GB/s
12-14 GB/s
UV 300chassis
Programming cost for NUMA-aware
• Thread and Memory binding
– Reduce remote access
– Avoid thread migration
• Linux provides naïve interfaces
– sched_{set,get}affinity()• binds a thread on a processor set (specifying by processor id)
– mbind()• binds pages on a NUMA node set (specifying by NUMA node id)
– Linux provides processer id and NUMA node id as system files;
/proc/cpuinfo, /sys/devices/system/{node,cpu}/
• Reducing programming cost using ULIBC
– provides Some APIs for NUMA-aware programming
– available at
https://bitbucket.org/yuichiro_yasui/ulibc
#define _GNU_SOURCE#include <sched.h>int bind_thread(int procid) {
cpu_set_t set;CPU_ZERO(&set);CPU_SET(procid, &set);return sched_setaffinity( (pid_t)0,
sizeof(cpu_set_t), &set) );}
Specifying by Processor id
2. Detects online topology
CPU Affinity construction using ULIBC
RAM
1. Detects entire topology
RAM
RAM RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
RAM
RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
numactl --cpunodebind=1,2 ¥--membind=1,2
e.g.)
3. Constructs two-type affinities
NUMA node 0
NUMA node 1
thread 0
thread 1thread 2
thread 3
RAM
RAM
Local RAM
assigns threads in a position close to each other.
Compact-type affinity
export ULIBC_AFFINITY=compact:fineexport OMP_NUM_THREADS=7e.g.)
export ULIBC_AFFINITY=scatter:fineexport OMP_NUM_THREADS=7e.g.)
NUMA node 0
NUMA node 1
thread 0 thread 2
thread 1 thread 3
RAM
RAM
Local RAM
distributes the threads as evenly as possible
across online processors.
Scatter-type affinityRAM
RAM
NUMA-aware computation with ULIBC• ULIBC is a callable library for NUMA-aware computation
• Detects processor topology on run time
• Constructs thread and memory affinity setting
#include <stdio.h>#include <omp.h>#include <ulibc.h>
int main(void) {ULIBC_init();
_Pragma("omp parallel") {const int tid = ULIBC_get_thread_num();
ULIBC_bind_thread();
const struct numainfo_t loc = ULIBC_get_numainfo( tid );
printf(”Thread: %2d, NUMA-node: %d, NUMA-core: %d¥n",loc.id, loc.node, loc.core);
/* do something */}
}
initialize
get thread idbind current thread
get NUMA placement
https://bitbucket.org/yuichiro_yasui/ulibc
Thread: 4, NUMA-node: 0, NUMA-core: 1Thread: 55, NUMA-node: 3, NUMA-core: 13Thread: 16, NUMA-node: 0, NUMA-core: 4Thread: 37, NUMA-node: 1, NUMA-core: 9Thread: 30, NUMA-node: 2, NUMA-core: 7. . .
Core IDNUMA node IDThread ID
include header file
ULIBC is available at
Execution log on 4-socket
Level-synchronized parallel BFS (Top-down)
• Started from source vertex and
executes following two phases at
each level
Level kLevel k+1CQNQ
Swap exchanges CQ and NQ for next
level
Traversal phase finds unvisited verticesfrom CQ and appends into NQ
visited
unvisited
NQLevel 1
SourceLevel 0CQ
Level 2
Level 1
NQ
CQ
Level 3
Level 2
NQ
CQ
Level 0
Sync.
Sync.
Level 1
Level 2
NQCQ
NQCQ
FrontierLevel k
Level k+1NeighborsFrontier Neighbors
Level k
Level k+1
Candidates of
neighbors
Direction-optimizing BFS [Beamer, SC12]
• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges
Outgoingedges Incoming
edges
Two directions; Top-down or Bottom-up
幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%
6 / 14
Distance from source
Large frontier
Top-down
Top-down
Bottom-up
Direction-opt. BFS
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
• 131 GTEPS
• 152 GTEPS (NEW)
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction ofremote edges
UV 2000 w/ 64 sockets
UV 300 w/ 32 sockets
• 219 GTEPS (NEW)
• New results
��16 %
� faster than highest
single-node entry
Binding on NUMA node
Ours: NUMA-aware 1-D part. graph [BD13]
• Divides sub graphs and assigns on each NUMA node
A0
A1
A2
A3
Adjacency matrix 1-D part. Graph
CPU
RAM
assigndivide
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
A0
A1
A2
A3
Input:
FrontierCQ
Data:
visited VSkOutput:
neighborsNQ
Local RAM
Bottom-up direction
• At bottom-up direction (Bottleneck component), each NUMA node
computes partial NQ using local copied CQ and local assigned VS.
Each sub graph represents by CSR graph
Top-down direction uses inverse of G.(G is undirected)
A0
A1
A2
A3
Input:
FrontierCQk
Data:
visited VSkOutput:
neighborsNQ
LocalLocal
Remote Remote
Modified version of Agarwal’s NUMA-aware BFS
Ours: Adjacency list sorting [ISC14]
• Reduces unnecessary edge traversals at Bottom-up dir.
Loop count τ
A(va)A(vb)
finds frontier vertex and breaks this loop
……
Bottom-up
Skipped adjacency verticesTraversed adjacency vertices
• Sorting adjacency lists by the corresponding outdegreeVertex vi Vertex vi+1
Index
Value
High Low
Adjacency vertices of vi
Sorting by outdegree
Ours: Vertex sorting [HPCS15]
Degree distribution
Access freq. w/ vertex sorting
• # of vertex traversals equals the outdegree of the corresponding vertex
• Our vertex sorting reorders vertex indices by the outdegrees
Access freq. and OutDegree are correlated4
3
2
0
1
1
3
0
4
2
Original
indices
Sorted indices by outdegree
Highest
outdegree
Many accessesfor small-index vertex
NUMA-aware Top-down BFS
• Original version was proposed by Agarwal [Agarwal-SC10]
• Reducing random remote accesses using socket-queue
CQ
Local + Remote
NUMA 0
NUMA 1
NUMA 2
NUMA 3
�Local : Remote = 1 : ℓ�on ℓ-sockets
e.g.) focused on NUMA 2
synchronize
Socket−queue
Local
VS NQ
synchronize
Swap CQ and NQ
Append unvisited
vertices into NQ
Local
Phase1: CQ � NQ or Socket-queue Phase2: Socket-queue �NQ Next level
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Append unvisited
vertices into NQ
NUMA-aware Top-down w/ Pruning remote edges
• pruning remote edges to reduce remote accesses
NUMA 0
NUMA 1
NUMA 2
NUMA 3
e.g.) focused remote edge traversal on NUMA 2
This paperproposed by Agarwal’s SC10 paper
with Pruningw/o Pruning (original)
Each NUMA node appends remote edges (v,w)
into the corresponding socket-queue, if the Fdoesn't contain w. (And then, F appends w)
Each NUMA node appends all remote
edges (v,w) into the corresponding
socket-queue
F (reuse CQ bitmap for Bottom-up)
CQ(vector queue)
Socket−queue
Remote
Local
Local
Remote
The F is not initialized,
while there is no change of
search direction.
CQ(vector queue)
Socket−queue
Remote
Remote
Each vertex is
searched once only.
Effects of pruning & Updated TEPS score
• Pruned many remote edges
0
16
32
48
0 16 32 48
Mem
ory
Pla
cem
ent
Thread Placement
0
5
10
15
20
25
30
35
(a) UV 2000 with 64 CPU sockets (One rack)
0
4
8
12
16
20
24
28
0 4 8 12 16 20 24 28
Mem
ory
Pla
cem
ent
Thread Placement
0
10
20
30
40
50
60
(b) UV 300 with 32 CPU sockets (One rack)
Figure 3: Memory bandwidth GB/s between arbitrary twoNUMA nodes.
16
32
64
128
256
512
1024
2048
3 1 2 4 8 16 32 64
Ban
dwid
th (G
B/s
)
Number of NUMA nodes (Number of CPU sockets)
UV300 (Haswell) (HT, THP, Local)UV300 (Haswell) (HT, THP, Remote)UV2000 (Ivy Bridge)SB4 (Sandy Bride-EP) (HT, THP)
Figure 4: Memory bandwidth (GB/s)
5. NUMERICAL RESULTS
5.1 SGI UV 2000Fig. 6 shows the weak scaling performances of our pre-
vious [22] and current implementation, which collect TEPS
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(4) (9.04K) (221M) (15.3B) (1.50B) (4.55M) (11.1K) (29)
LocalPruned-remoteRemote
Level 7Level 6Level 5Level 4Level 3Level 2Level 1Level 0
Figure 5: Ratio of traversed edges on a NUMA node in thetop-down algorithm with remote edge traversal pruning fora Kronecker graph with SCALE29 on a four-socket server(SB4). Each number in a bracket represents the total num-ber of traversed edges at each level.
Algorithm 3: Top-down with pruning remote traversal
Procedure NUMA-aware-Top-down(G,CQ,VS,⇡)fork/* i-th thread runs on j-th core of k-th CPU */
1 (i, j, k) ULIBC get current numainfo()2 NQk ;3 for v 2 CQk in parallel do4 for w 2 AB
k (v) do5 if owner(w) = k then6 if w 62 VS atomic then7 ⇡(w) v8 VSk VSk [ {w}9 NQk NQk [ {w}
10 else11 if w 62 Fk atomic then12 Fk Fk [ {w}13 SQ
owner(w)
SQowner(w)
[ {(v, w)}
14 synchronize15 for (v, w) 2 SQk in parallel do16 if w 62 VS atomic then17 ⇡(w) v18 VSk VSk [ {w}19 NQk NQk [ {w}
20 join21 return NQk
scores with fixed problem size as SCALE 26 and SCALE27 per CPU socket. The previous implementation scaledup to 1,280 threads, and achieves 131 GTEPS for SCALE32 with 640 threads and 175 GTEPS for SCALE 33 with1280 threads, respectively. In contrast, the current imple-mentation achieves 152 GTEPS for SCALE 33 with 640threads – scalability is improved as a result of the prun-ing of edge traversal for remote memory in the top-downdirection. However, we only have results for a maximumof 640 threads. The performance gap between the previous(131 GTEPS for SCALE 33) and current (152 GTEPS forSCALE 34) implementations is 15.8% (= 152
131
) on one rackof UV 2000 with 640 threads.
Top-down Bottom-up Top-down
0
50
100
150
200
1 2 4 8 16 32 64 128
GTE
PS
Number of NUMA nodes (CPU sockets)
HPCS15-SG (SCALE 26 per NUMA node)This paper (SCALE 27 per NUMA node)
7.715.3
24.2
42.1
59.4
94.8
131.4
174.7
8.314.2
25.138.6
61.5
91.8
152.2
Figure 6: Weak scaling on UV 2000
5.2 SGI UV 300In this study, we obtained new results on SGI UV 300,
which has 32 CPU sockets and 16 TB memory. Fig. 7 de-picts TEPS versus number of CPU sockets (NUMA nodes).Table 8 shows the TEPS obtained for 32 CPU sockets. Wediscuss the results with the following parameters:
• Hyperthreading (HT): {enabled, disabled}• Transparent hugepage (THP): {enabled, disabled}• Priority mode for memory access: {local, remote}
For example, “(HT, THP, local)”means Hyperthreading andTransparent hugepage are enabled and priority mode is setfor local memory. In this table, a check mark (X) indicatesthat a parameter is enabled. First, UV 300 is faster than UV2000 for large problem sizes. Second, both Hyperthreading(HT) and Transparent hugepage (THP) together improvedthe performance by 16.49 % (= 219
188
) and 4.78 % (= 219
209
).Third, our implementation applied several techniques thatimproved the locality of memory access to make it suitablefor priority mode set as local memory. Ultimately, the bestperformance obtained was 219 GTEPS for SCALE 34 withthe configuration set as (HT, THP, Local), indicated in boldfont above.
0
50
100
150
200
250
1 2 4 8 16 32 64
GTE
PS
Number of NUMA nodes (CPU sockets)
UV300 (HT, THP, Local): SCALE 29 per socketUV300 (HT, THP, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 27 per socketUV2000: SCALE 27 per socket
18.732.5
64.7
100.3
161.5
219.4
8.3 14.225.1
38.6
61.5
91.8
152.2
Figure 7: Weak scaling on UV 300
Table 8: GTEPS on UV 300 with 32 CPU sockets
System SCALE HT THP Mode GTEPSUV 2000 32 – – – 92UV 300 32 X – Remote 171UV 300 34 X – Remote 204UV 300 34 X X Remote 209UV 300 34 ⇤1 X Local 188UV 300 34 X X Local 219⇤1 use the number of threads same as physical cores.
5.3 STREAM and Graph500 benchmarksFinally, the correlativity between the memory bandwidth
(bytes per seconds) for the STREAM benchmark and thegraph traversal performance (TEPS) is depicted in Fig 8.Each line represents pairs of {Memory bandwidth (in GB/s)of the STREAM benchmark TRIAD operation with 107
elements per CPU socket, Graph500 score (GTEPS) withSCALE 27 per CPU socket} for each of 1, 2, 4, 8, 16, and32 CPU sockets. We obtained the memory bandwidth scorevia a modified implementation using ULIBC, in which eachthread computed the partial TRIAD operation for vectorson local memory only, shown in subsection 3.2. Figure showscorrelativity between the memory bandwidth and the graphtraversal performance. The optimized Graph500 implemen-tation and our previous implementation are scalable, likethe memory bandwidth. In contrast, the reference code ofGraph500 is not scalable and cannot exploit the NUMA sys-tem e�ciently.
2
4
8
16
32
64
128
256
16 32 64 128 256 512 1024 2048
GTE
PS
Memory Bandwidth (GB/s)
UV300 (Haswell) (HT, THP, Local): This paperUV2000 (Ivy Bridge): This paperUV2000 (Ivy Bridge): BD13SB4 (Sandy Bride-EP) (HT, THP): This paperSB4 (Sandy Bride-EP) (HT, THP): BD13
(a) GTEPS
32
64
128
16 32 64 128 256 512 1024 2048
MTE
PS
Memory Bandwidth (GB/s)
SB4 (Sandy Bride-EP) (HT, THP): Reference
(b) MTEPS
Figure 8: TEPS versus Memory bandwidth (GB/s)
6. CONCLUSIONSIn this paper, we presented a new and e�cient breadth-
first search algorithm for large-scale networks on a single
SCALE29, on 4-socket Xeon UV2000 (used only 64 sockets)
However, this method may not be effective on a few
sockets, because the algorithm switch a direction as
the bottom-up at middle levels.
Pruned
Previous: TD � BU
This paper: TD � BU � TD
On many sockets, Updated TEPS score
UV 2000 with 64 sockets;
• w/o pruning: 131 GTEPS
• w/ pruning: 152 GTEPS
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Weak scaling performance
• UV 300 clearly outperforms UV 2000.
0
50
100
150
200
1 2 4 8 16 32 64
GTE
PS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.314.2
25.138.6
61.5
91.8
152.2
16.329.2
53.5
83.9
129.4
171.0
15.928.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
UV 2000
UV 300
Compared on next slide.
Breakdown of system configuration on UV300
• UV 300 is 2x faster than UV 2000
– same sockets (32 sockets)
– #ThreadsPerSockets = #logical cores
• Best perf. of UV 300 obtained with
– Larger problem size
– THP (transparent huge page) enabled
– Set memory reference mode as local-mode
– HT (Hyperthreading) enabled
System #sockets SCALE HT THP Mem-Ref-mode GTEPS
UV2000 32 32 − 92
UV300 32
32 � Remote 171
34
� Remote 204
� � Remote 209
*1 � Local 188
� � Local 219 +16.5% by HT enabled
0
50
100
150
200
1 2 4 8 16 32 64
GTE
PS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.314.2
25.138.6
61.5
91.8
152.2
16.329.2
53.5
83.9
129.4
171.0
15.928.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
+ 4.8% by Memory
Reference mode
+ 2.5% by THP enabled
+19.3% by using a
larger memory space
*1: uses # of threads same as physical cores. � emulated "Hyperthreading disabled”.
Perf.
gap
28.3 % perf. gap
New results and Nov. 2015 list
Updated fastest single-node
Ours
fastest of
single-node
Ours
SCALE34
219 GTEPS
SGI UV300(1 node / 576 cores)
− HT enabled
− THP enabled
− local-ref. mode
SGI UV2000(1280 cores)
SCALE 33
174.7 GTEPS
SGI UV2000(640 cores)
SCALE 33
149.8 GTEPS
Bandwidth and TEPS
• BW and TEPS of our implementations on 3 systems
– GB/s: STEAM TRIAD with 10 M elements per socket
– TEPS: SCALE27 (n=134M, m=2.15B) per socket
0
50
100
150
200
1 2 4 8 16 32 64 128
GTE
PS
Number of NUMA nodes (CPU sockets)
HPCS15-SG (SCALE 26 per NUMA node)This paper (SCALE 27 per NUMA node)
7.715.3
24.2
42.1
59.4
94.8
131.4
174.7
8.314.2
25.138.6
61.5
91.8
152.2
Figure 6: Weak scaling on UV 2000
5.2 SGI UV 300In this study, we obtained new results on SGI UV 300,
which has 32 CPU sockets and 16 TB memory. Fig. 7 de-picts TEPS versus number of CPU sockets (NUMA nodes).Table 8 shows the TEPS obtained for 32 CPU sockets. Wediscuss the results with the following parameters:
• Hyperthreading (HT): {enabled, disabled}• Transparent hugepage (THP): {enabled, disabled}• Priority mode for memory access: {local, remote}
For example, “(HT, THP, local)”means Hyperthreading andTransparent hugepage are enabled and priority mode is setfor local memory. In this table, a check mark (X) indicatesthat a parameter is enabled. First, UV 300 is faster than UV2000 for large problem sizes. Second, both Hyperthreading(HT) and Transparent hugepage (THP) together improvedthe performance by 16.49 % (= 219
188
) and 4.78 % (= 219
209
).Third, our implementation applied several techniques thatimproved the locality of memory access to make it suitablefor priority mode set as local memory. Ultimately, the bestperformance obtained was 219 GTEPS for SCALE 34 withthe configuration set as (HT, THP, Local), indicated in boldfont above.
0
50
100
150
200
250
1 2 4 8 16 32 64
GTE
PS
Number of NUMA nodes (CPU sockets)
UV300 (HT, THP, Local): SCALE 29 per socketUV300 (HT, THP, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 27 per socketUV2000: SCALE 27 per socket
18.732.5
64.7
100.3
161.5
219.4
8.3 14.225.1
38.6
61.5
91.8
152.2
Figure 7: Weak scaling on UV 300
Table 8: GTEPS on UV 300 with 32 CPU sockets
System SCALE HT THP Mode GTEPSUV 2000 32 – – – 92UV 300 32 X – Remote 171UV 300 34 X – Remote 204UV 300 34 X X Remote 209UV 300 34 ⇤1 X Local 188UV 300 34 X X Local 219⇤1 use the number of threads same as physical cores.
5.3 STREAM and Graph500 benchmarksFinally, the correlativity between the memory bandwidth
(bytes per seconds) for the STREAM benchmark and thegraph traversal performance (TEPS) is depicted in Fig 8.Each line represents pairs of {Memory bandwidth (in GB/s)of the STREAM benchmark TRIAD operation with 107
elements per CPU socket, Graph500 score (GTEPS) withSCALE 27 per CPU socket} for each of 1, 2, 4, 8, 16, and32 CPU sockets. We obtained the memory bandwidth scorevia a modified implementation using ULIBC, in which eachthread computed the partial TRIAD operation for vectorson local memory only, shown in subsection 3.2. Figure showscorrelativity between the memory bandwidth and the graphtraversal performance. The optimized Graph500 implemen-tation and our previous implementation are scalable, likethe memory bandwidth. In contrast, the reference code ofGraph500 is not scalable and cannot exploit the NUMA sys-tem e�ciently.
2
4
8
16
32
64
128
256
16 32 64 128 256 512 1024 2048
GTE
PS
Memory Bandwidth (GB/s)
UV300 (Haswell) (HT, THP, Local): This paperUV2000 (Ivy Bridge): This paperUV2000 (Ivy Bridge): BD13SB4 (Sandy Bride-EP) (HT, THP): This paperSB4 (Sandy Bride-EP) (HT, THP): BD13
(a) GTEPS
32
64
128
16 32 64 128 256 512 1024 2048
MTE
PS
Memory Bandwidth (GB/s)
SB4 (Sandy Bride-EP) (HT, THP): Reference
(b) MTEPS
Figure 8: TEPS versus Memory bandwidth (GB/s)
6. CONCLUSIONSIn this paper, we presented a new and e�cient breadth-
first search algorithm for large-scale networks on a single
Our previous [BD13]
This paper
• Bandwidth and GTEPS are correlated on three Xeon processors
• UV300– 32-sockets Haswell
• UV2000– 64-sockets Ivy Bridge
• SB4– 4-sockets Sandy Bridge-EP
Systems
Conclusion
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– NUMA-aware to improved a locality of memory access
– Exploit multithreading on many-socket system (SGI UV2000, UV300)
Motivations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
Local access
Remote access
・・・
Many-socket systemRepresents many relationships by graph structure
• NUMA-aware scalable BFS algorithm
– Scalable more than thousand threads on SGI UV 2000 and SGI UV 300
– Updated highest score single-node as 219 GTEPS on SGI UV300 with 32 sockets
• “ULIBC”: Callable library for NUMA-aware computation
– available at https://bitbucket.org/yuichiro_yasui/ulibc
Contributions Pruning edge traversalto reduce remote edges
References
• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel
Breadth-first Search on Multicore Single-node System, IEEE BigData 2013
• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient
Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread
parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.
• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui,
K. Iwabuchi, and T. Endo: Advanced Computing & Optimization
Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale
upercomputers, Proceedings of the Optimization in the Real World --
Toward Solving Real-World Optimization Problems --, Springer, 2015.
NUMA-aware BFS algorithm
Other results of Our Graph500 team
top related