© 2011 IBM Corporation Scalable Algorithms for Massive- scale Graphs KAUST, SC11 November 2011 Fabrizio Petrini, IBM TJ Watson
Dec 14, 2015
© 2011 IBM Corporation
Scalable Algorithms for Massive-scale Graphs
KAUST, SC11 November 2011
Fabrizio Petrini, IBM TJ Watson
KAUST, SC11, November 2011 © 2011 IBM Corporation
Linkedin Network
KAUST, SC11, November 2011 © 2011 IBM Corporation
Data in real-world: Large Complex Networks
Social Networks
Internet
BiologicalFinance
Transportation
Internet :Nodes are the routers, edges are the connections. Finding network topology, solving bandwidth, flow and shortest paths problems.(1.8 billion users of the internet and growing)
Network Security:Large graphs that can be use to dynamically track network behavior and identify potential attacks
1Data source: DIMACS, Wikipedia
KAUST, SC11, November 2011 © 2011 IBM Corporation
Data in real-world: Large Complex Networks
Social Networks
Internet
BiologicalFinance
1Data source: Nielson Report, The Inquirer
Transportation
Social networks: Nodes represent people, joined through personal or professional acquaintance, or through groups and communities.
Facebook : 100 to 150 million users in 4
months, >400 million currently Twitter : 75 million users (Jan 2010)
6.2 million new users per month
KAUST, SC11, November 2011 © 2011 IBM Corporation
KAUST, SC11, November 2011 © 2011 IBM Corporation
KAUST, SC11, November 2011 © 2011 IBM Corporation
KAUST, SC11, November 2011 © 2011 IBM Corporation
1. Chip:16+2 P
cores
2. Single Chip Module
4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes
7. System: 96 racks, 20PF/s
3. Compute card:One chip module,8/16 GB DDR3 Memory,Heat Spreader for H2O Cooling
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus
• Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming
models for exploitation of node hardware concurrency
Blue Gene/Q
KAUST, SC11, November 2011 © 2011 IBM Corporation
SPI PAMI MPI
1184 pclk 1150 pclk
530 pclk
Message Latency
SPI PAMI MPI
Message Rate Single threaded
779 pclk245
134
Total MPI Latency 2864pclk Total MPI Overhead 1172pclk
Our Graph500 implementation relies on SPI
KAUST, SC11, November 2011 © 2011 IBM Corporation
10*2GB/s intra-rack & inter-rack (5-D torus)Network
2 GB/s I/O link (to I/O subsystem)
16+1 core SMP
Each core 4-way hardware threaded
Transactional memory and thread level speculation
Quad floating point unit on each core
204.8 GF peak node
Frequency: 1.6 GHz
563 GB/s bisection bandwidth to shared L2
(Blue Gene/L at LLNL has 700 GB/s for system)
32 MB shared L2 cache
42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3)
(2 channels each with chip kill protection)
10 intra-rack interprocessor links each at 2.0GB/s
one I/O link at 2.0 GB/s
16 GB memory/node
55 watts chip power
DDR-3Controller
External DDR3
Test
Blue Gene/Q compute chip
Blue Gene/Q chip architecture
PCI_Express note: chip I/O shares function with PCI_Express
dma
PPCFPU
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PFL1
PPCFPUPPCFPUPPCFPUPPCFPU
PPCFPU
PPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPU
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
2MBL2
DDR-3Controller External
DDR3
full
cro
ssb
ar s
witc
h
IBM Confidential
KAUST, SC11, November 2011 © 2011 IBM Corporation
switch
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
2MBL2
2MBL2
2MBL2
2MBL2
Scalable Atomic Operation(fetch_and_inc for example – queuing lock)
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
1
PPCFPU
L1 PF
1.1
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
1.2
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
PPCFPU
L1 PF
1.3
PPCFPU
L1 PF
1 round trip + 4 L2 cycles
Where N is the number of threads
For N=64 and L2 75 cycles 331 cycles
Compared to 9600 cycles for standard
2MBL2
2MBL2
2MBL2
2MBL2
IBM Confidential