© 2009 IBM Corporation QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.) Heiko J Schick – IBM Deutschland R&D GmbH November 2010
May 24, 2015
© 2009 IBM Corporation
QPACEQCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Heiko J Schick – IBM Deutschland R&D GmbH
November 2010
© 2009 IBM Corporation2
Agenda
Chapter 1: Overview
Chapter 2: Application optimized supercomputers
Chapter 3: QPACE
Chapter 4: Review and Summary
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation3
Building Blocks of Matter
QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Quarks are the constituents of matter which strongly interact exchanging gluons.
Particular phenomena– Confinement– Asymptotic freedom (Nobel Prize 2004)
Theory of strong interactions = Quantum Chromodynamics (QCD)
Chapter 1: Overview
© 2009 IBM Corporation4
Computing Resource Requests
Lattice QCD community aims for O(1−3) PFlops/s sustained beyond 2010.
Europe– “The computational requirements voiced by these European groups sum up to more than
1 sustained Petaflop/s by 2009.” [HPC in Europe Taskforce (HET), 2006]
US (USQCD)– Hope for O(1) PFlops/s sustained in 2010-11. “A goal with very substantial scientific
rewards.” [USQCD SciDAC-2 proposal, 2006]
Similar requests from Japan.
Chapter 1: Overview
© 2009 IBM Corporation5
Overall performance of lattice QCD simulations dominated by a few kernels:
– Linear algebra• Single processor operations• Typically memory bandwidth limited
– Global reductions• Typically limited by network latency:• d-dimensional torus network:
– Sparse matrix-vector multiplication
Performance Critical KernelsChapter 2: Application optimized supercomputers
© 2009 IBM Corporation6
Relevant Performance Signatures
Arithmetic operations– Floating-point arithmetic's with complex operands– Dominant operation a × b + c
Memory operations– High data re-use– Access pattern:
• Random, small blocks (optimize for cache)• 3 streams, large blocks (vector-like architectures)
Flow control– Simple / predictable
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation7
Parallelization
Parallelization strategy– Spatial domain decomposition to partition the simulation domain into small 3d sub-
domains, one of the sub-domain is assigned to each processor.
Nearest neighbour communication– 3-4 dimensional torus
Homogeneous communication patterns
Large bandwidth
Access pattern– Medium size messages = O(10) kBytes (large local problem size)
– Small messages = O(0.1) kBytes (small local problem size)
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation8
Multiply a Vector X by a Scalar, Add to a Vector Y, and Store in the Vector Y.
Task:
where is a complex scalar and are complex 3x4 matrices
Operation per i: = 96 FLOPS
Information transfer between storage and register file (front-end to processing device):
– Load: = 48 8-byte words
– Store: = 24 8-byte words
Balance: = 1.3 FLOPS / word
M
RF
Performance Signature: caxpyChapter 2: Application optimized supercomputers
© 2009 IBM Corporation9
Sustained Performance
Bandwidth/throughput of a device:
Time needed to execute task i:
where amount of processed data latency
Efficiency is
– “Ideal” execution time
– “Real” execution time
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation10
Relevant Hardware Characteristics
Floating point unit throughput:
– Caveat: Processor instruction set matching• No support for complex arithmetic's (e.g. Cell/B.E.)• Additional shuffle operations needed.
Memory bandwidth:
– Multi-level memory hierarchy• External memory• Cache• Register file
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation11
Balanced Hardware
Example caxpy:
Processor FPU throughput
[FLOPS / cycle]
Memory bandwidth
[words / cycle] [FLOPS / word]
apeNEXT 8 2 4
QCDOC (MM) 2 0.63 3.2
QCDOC (LS) 2 2 1
Xeon 2 0.29 7
GPU 128 x 2 17.3 (*) 14.8
Cell/B.E. (MM) 8 x 4 1 32
Cell/B.E. (LS) 8 x 4 8 x 4 2
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation12
Cell/B.E. ArchitectureChapter 2: Application optimized supercomputers
© 2009 IBM Corporation
Balanced Systems ?!?
13
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation
… but are they Reliable, Available and Serviceable ?!?
14
Chapter 2: Application optimized supercomputers
© 2009 IBM Corporation15
Collaboration and Credits
QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Academic Partners– University Regensburg S. Heybrock, D. Hierl, T. Maurer, N. Meyer, A. Nobile, A. Schaefer, S. Solbrig, T. Streuer, T. Wettig
– University Wuppertal Z. Fodor, A. Frommer, M. Huesken
– University Ferrara M. Pivanti, F. Schifano, R. Tripiccione
– University Milano H. Simma
– DESY Zeuthen D.Pleiter, K.-H. Sulanke, F. Winter
– Research Lab Juelich M. Drochner, N. Eicker, T. Lippert
Industrial Partner– IBM (DE, US, FR) H. Baier, H. Boettiger, A. Castellane, J.-F. Fauh, U. Fischer, G. Goldrian, C. Gomez, T. Huth, B. Krill,
J. Lauritsen, J. McFadden, I. Ouda, M. Ries, H.J. Schick, J.-S. Vogt
Main Funding– DFG (SFB TR55), IBM
Support by Others– Eurotech (IT) , Knuerr (DE), Xilinx (US)
Chapter 3: QPACE
© 2009 IBM Corporation
Project Timetable
01/08 Official project start
06/08 Node card bring-up
10/08 Fully populated backplane
01/09 Hardware integration tests
02-03/09 Release to manufacturing
05/09 Integration of 1st rack
07/09 Deployment of 2 racks at JSC
08/09 Deployment of 4 racks at JSC and 4 racks at University Wuppertal complete
16
© 2009 IBM Corporation
Production Chain
Major steps– Pre-integration at University Regensburg– Integration at IBM / Boeblingen– Installation at FZ Juelich and University Wuppertal
17
© 2009 IBM Corporation18
Concept
System– Node card with IBM® PowerXCell™ 8i processor and network processor (NWP)
• Important feature: fast double precision arithmetic's– Commodity processor interconnected by a custom network– Custom system design– Liquid cooling system
Rack parameters– 256 node cards
• 26 TFLOPS peak (double precision)• 1 TB Memory
– O(35) kWatt power consumption
Applications– Target sustained performance of 20-30%– Optimized for calculations in theoretical particle physics:
Simulation of Quantum Chromodynamics
Chapter 3: QPACE
© 2009 IBM Corporation19
Networks
Torus network– Nearest-neighbor communication, 3-dimensional torus topology– Aggregate bandwidth 6 GByte/s per node and direction– Remote DMA communication (local store to local store)
Interrupt tree network– Evaluation of global conditions and synchronization– Global Exceptions– 2 signals per direction
Ethernet network– 1 Gigabit Ethernet link per node card to rack-level switches (switched network)– I/O to parallel file system (user input / output)– Linux network boot– Aim of O(10) GB bandwidth per rack
Chapter 3: QPACE
© 2009 IBM Corporation20
Backplane(8 per rack)
Power Supply and Power Adapter Card(24 per rack)
Rack
Root Card(16 per rack)
Node Card(256 per rack)
Chapter 3: QPACE
© 2009 IBM Corporation21
Node Card
Components– IBM PowerXCell 8i processor 3.2 GHZ– 4 Gigabyte DDR2 memory 800 MHZ with ECC– Network processor (NWP) Xilinx FPGA LX110T FPGA– Ethernet PHY – 6 x 1GB/s external links using PCI Express physical layer– Service Processor (SP) Freescale 52211– FLASH (firmware and FPGA configuration)– Power subsystem– Clocking
Network Processor– FLEXIO interface to PowerXCell 8i processor, 2 bytes with 3 GHZ bit rate – Gigabit Ethernet– UART FW Linux console– UART SP communication– SPI Master (boot flash)– SPI Slave for training and configuration– GPIO
Chapter 3: QPACE
© 2009 IBM Corporation22
MemoryPowerXCell 8iProcessor
Network Processor(FPGA)
Network PHYs
Node CardChapter 3: QPACE
© 2009 IBM Corporation23
PowerXCell 8i
FPGA Virtex-5
800MHz
PowerSubsystem
FLEXIO6GB/s
SPI
Compute Network
SPFreescale
MCF52211
RS232
384 IO@250MHZ
4*8*2*6 = 384 IO680 available (LX110T)
FLEXIO6GB/s
Clocking
UART
Flash
6x 1GB/s PHY
SPI
I2C
SPII2C
GigE
RW(Debug)
DDR2 DDR2DDR2DDR2
PHY
Node CardChapter 3: QPACE
© 2009 IBM Corporation
Network Processor
24
Chapter 3: QPACE
x+
x-
z-
FlexIOInterfaceFlexIO
Interface
Network Logic
Routing
Arbitration
FIFOs
Configuration
Network Logic
Routing
Arbitration
FIFOs
Configuration
PHYPHY
PHYPHY
PHYPHY
Link Interface
Link Interface
Link Interface
Link Interface
Link Interface
Link Interface
Ethernet InterfaceEthernet Interface
Global SignalsGlobal Signals
SerialInterfaces
SerialInterfaces
PHYPHY
SPI FlashSPI Flash
Slices 92 %
PINs 86 %
LUT-FF pairs 73 %
Flip-Flops 55 %
LUTs 53 %
BRAM / FIFOs 35 %
Flip-Flops LUTs
Processor Interface 53 % 46 %
Torus 36 % 39 %
Ethernet 4 % 2 %
© 2009 IBM Corporation25
Network Processor
FELX iO
RocketIO
GBIF
IOC (IOIF)
MasterSlave
Switch / Address Decoder / FIFOs / Bus Controller
Send RequestsReceive Requests
IOC (IOIF)
FlexIO
IBM:
• RocketIO Logic
• IOC Logic
• GBIF Logic
Academic Partners:
• Network Processor Logic
6 x 1GB/S
Chapter 3: QPACE
© 2009 IBM Corporation26
Processor Bus Interface
FlexIO Interface– High bandwidth interface between IBM PowerXCell 8i processor and Xilinx Viretx-5 FPGA– Implementation from Rambus Inc– Optimized for intra-board environments– Uses RocketIO GPT transceiver features– Requires link training after power-on
• Phase calibration (aligns the data for optimal sampling point)
• Parallel calibration (synchronizes the receive deserializer with the transmit serializer)
• Levelization calibration (aligns all data lanes)
Challenges– Speed, Latency, Bandwidth and Timing (Clock)– 3 Gbyte/sec communication channel– 2 Byte link wide
Chapter 3: QPACE
© 2009 IBM Corporation
Torus Network Physical Layer
Physical layer– 10GbE @ 2.5 GHz → 1 GByte/s
Eye diagram for bad case link– 3.125 GHz– 40 cm PCB, 50 cm cable,– 1 PCB-PCB, 2 PCB-cable connectors
Custom data link layer– Fixed size messages– 128 Byte payload + 4 Byte header + 4 Byte CRC
→ Minimal protocol overhead
27
Chapter 3: QPACE
© 2009 IBM Corporation
Torus Network Architecture
2-sided communication– Node A initiates send, node B initiates receive– Send and receive commands have to match– Multiple use of same link by virtual channels
Send / receive from / to local store or main memory– CPU → NWP
• CPU moves data and control info to NWP• Back-pressure controlled
– NWP → NWP• Independent of processor• Each datagram has to be acknowledged
– NWP → CPU• CPU provides credits to NWP• NWP writes data into processor• Completion indicated by notification
28
© 2009 IBM Corporation
Torus Network Reconfiguration
Torus network PHYs provide 2 interfaces– Used for network reconfiguration b selecting primary or secondary interface
Example– 1x8 or 2x4 node-cards
Partition sizes (1,2,2N) * (1,2,4,8,16) * (1,2,4,8)– N ... number of racks connected via cables
29
Chapter 3: QPACE
© 2009 IBM Corporation
Cooling
Concept– Node card mounted in housing = heat conductor– Housing connected to liquid cooled cold plate– Critical thermal interfaces
• Processor – thermal box• Thermal box – cold plate
– Dry connection between node card and cooling circuit
Node card housing – Closed node card housing acts as heat conductor.– Heat conductor is linked with liquid-cooled “cold plate”– Cold Plate is placed between two rows of node cards.
Simulation Results for one Cold Plate– Ambient 12°C– Water 10 L / min– Load 4224 Watt
2112 Watt / side
30
Chapter 3: QPACE
© 2009 IBM Corporation
Power Efficiency
31
Chapter 3: QPACE
© 2009 IBM Corporation
Project Review
Hardware design– Almost all critical problems solved in time– Network Processor implementation still a challenge– No serious problems due to wrong design decisions
Hardware status– Manufacturing quality good: Small bone pile, few defects during operation.
Time schedule– Essentially stayed within planned schedule– Implementation of system / application software delayed
32
Chapter 4: Review and Summary
© 2009 IBM Corporation
Summary
QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.
Design highlights– FPGA directly attached to processor– LQCD optimized, low latency torus network– Novel, cost-efficient liquid cooling system– High packaging density– Very power efficient architecture
O(20-30%) sustained performance for key LQCD kernels is reached / feasible
→ O(10-16) TFLOPS / rack (SP)
33
Chapter 4: Review and Summary
© 2009 IBM Corporation34
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation35
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation36
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation37
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation38
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation39
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation40
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation41
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation42
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation43
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation44
Chapter 5: Unforgettable Impressions ;-)
© 2009 IBM Corporation45
© 2009 IBM Corporation46
Thank you very much for your attention.
© 2009 IBM Corporation47
Disclaimer
IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries.
Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.
Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change.