© 2011 IBM Corporation Scalable Algorithms for Massive-scale Graphs KAUST, SC11 November 2011 Fabrizio Petrini, IBM TJ Watson.

© 2011 IBM Corporation

Scalable Algorithms for Massive-scale Graphs

KAUST, SC11 November 2011

Fabrizio Petrini, IBM TJ Watson

KAUST, SC11, November 2011 © 2011 IBM Corporation

Linkedin Network


Data in real-world: Large Complex Networks

Social Networks

Internet

BiologicalFinance

Transportation

Internet :Nodes are the routers, edges are the connections. Finding network topology, solving bandwidth, flow and shortest paths problems.(1.8 billion users of the internet and growing)

Network Security:Large graphs that can be use to dynamically track network behavior and identify potential attacks

1Data source: DIMACS, Wikipedia


Data in real-world: Large Complex Networks

Social Networks

Internet

BiologicalFinance

1Data source: Nielson Report, The Inquirer

Transportation

Social networks: Nodes represent people, joined through personal or professional acquaintance, or through groups and communities.

Facebook : 100 to 150 million users in 4

months, >400 million currently Twitter : 75 million users (Jan 2010)

6.2 million new users per month





1. Chip:16+2 P

cores

2. Single Chip Module

4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes

7. System: 96 racks, 20PF/s

3. Compute card:One chip module,8/16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

• Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming

models for exploitation of node hardware concurrency

Blue Gene/Q


SPI PAMI MPI

1184 pclk 1150 pclk

530 pclk

Message Latency

SPI PAMI MPI

Message Rate Single threaded

779 pclk245

134

Total MPI Latency 2864pclk Total MPI Overhead 1172pclk

Our Graph500 implementation relies on SPI


10*2GB/s intra-rack & inter-rack (5-D torus)Network

2 GB/s I/O link (to I/O subsystem)

16+1 core SMP

Each core 4-way hardware threaded

Transactional memory and thread level speculation

Quad floating point unit on each core

204.8 GF peak node

Frequency: 1.6 GHz

563 GB/s bisection bandwidth to shared L2

(Blue Gene/L at LLNL has 700 GB/s for system)

32 MB shared L2 cache

42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3)

(2 channels each with chip kill protection)

10 intra-rack interprocessor links each at 2.0GB/s

one I/O link at 2.0 GB/s

16 GB memory/node

55 watts chip power

DDR-3Controller

External DDR3

Test

Blue Gene/Q compute chip

Blue Gene/Q chip architecture

PCI_Express note: chip I/O shares function with PCI_Express

dma

PPCFPU

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PFL1

PPCFPUPPCFPUPPCFPUPPCFPU

PPCFPU

PPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPUPPCFPU

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

2MBL2

DDR-3Controller External

DDR3

full

cro

ssb

ar s

witc

h

IBM Confidential


switch

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

2MBL2

2MBL2

2MBL2

2MBL2

Scalable Atomic Operation(fetch_and_inc for example – queuing lock)

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

1

PPCFPU

L1 PF

1.1

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

1.2

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

PPCFPU

L1 PF

1.3

PPCFPU

L1 PF

1 round trip + 4 L2 cycles

Where N is the number of threads

For N=64 and L2 75 cycles 331 cycles

Compared to 9600 cycles for standard

2MBL2

2MBL2

2MBL2

2MBL2

IBM Confidential

© 2011 IBM Corporation Scalable Algorithms for Massive-scale Graphs KAUST, SC11 November 2011 Fabrizio Petrini, IBM TJ Watson.

Documents

ibm corporation slide

ibm corporation data

ibm tj watson slide

gbs io link

spi slide

wikipedia slide

month slide

gbs ddr3 bandwidth