Top Banner
© 2009 IBM Corporation QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.) Heiko J Schick – IBM Deutschland R&D GmbH November 2010
47

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

May 24, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

QPACEQCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

Heiko J Schick – IBM Deutschland R&D GmbH

November 2010

Page 2: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation2

Agenda

Chapter 1: Overview

Chapter 2: Application optimized supercomputers

Chapter 3: QPACE

Chapter 4: Review and Summary

Chapter 5: Unforgettable Impressions ;-)

Page 3: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation3

Building Blocks of Matter

QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

Quarks are the constituents of matter which strongly interact exchanging gluons.

Particular phenomena– Confinement– Asymptotic freedom (Nobel Prize 2004)

Theory of strong interactions = Quantum Chromodynamics (QCD)

Chapter 1: Overview

Page 4: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation4

Computing Resource Requests

Lattice QCD community aims for O(1−3) PFlops/s sustained beyond 2010.

Europe– “The computational requirements voiced by these European groups sum up to more than

1 sustained Petaflop/s by 2009.” [HPC in Europe Taskforce (HET), 2006]

US (USQCD)– Hope for O(1) PFlops/s sustained in 2010-11. “A goal with very substantial scientific

rewards.” [USQCD SciDAC-2 proposal, 2006]

Similar requests from Japan.

Chapter 1: Overview

Page 5: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation5

Overall performance of lattice QCD simulations dominated by a few kernels:

– Linear algebra• Single processor operations• Typically memory bandwidth limited

– Global reductions• Typically limited by network latency:• d-dimensional torus network:

– Sparse matrix-vector multiplication

Performance Critical KernelsChapter 2: Application optimized supercomputers

Page 6: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation6

Relevant Performance Signatures

Arithmetic operations– Floating-point arithmetic's with complex operands– Dominant operation a × b + c

Memory operations– High data re-use– Access pattern:

• Random, small blocks (optimize for cache)• 3 streams, large blocks (vector-like architectures)

Flow control– Simple / predictable

Chapter 2: Application optimized supercomputers

Page 7: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation7

Parallelization

Parallelization strategy– Spatial domain decomposition to partition the simulation domain into small 3d sub-

domains, one of the sub-domain is assigned to each processor.

Nearest neighbour communication– 3-4 dimensional torus

Homogeneous communication patterns

Large bandwidth

Access pattern– Medium size messages = O(10) kBytes (large local problem size)

– Small messages = O(0.1) kBytes (small local problem size)

Chapter 2: Application optimized supercomputers

Page 8: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation8

Multiply a Vector X by a Scalar, Add to a Vector Y, and Store in the Vector Y.

Task:

where is a complex scalar and are complex 3x4 matrices

Operation per i: = 96 FLOPS

Information transfer between storage and register file (front-end to processing device):

– Load: = 48 8-byte words

– Store: = 24 8-byte words

Balance: = 1.3 FLOPS / word

M

RF

Performance Signature: caxpyChapter 2: Application optimized supercomputers

Page 9: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation9

Sustained Performance

Bandwidth/throughput of a device:

Time needed to execute task i:

where amount of processed data latency

Efficiency is

– “Ideal” execution time

– “Real” execution time

Chapter 2: Application optimized supercomputers

Page 10: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation10

Relevant Hardware Characteristics

Floating point unit throughput:

– Caveat: Processor instruction set matching• No support for complex arithmetic's (e.g. Cell/B.E.)• Additional shuffle operations needed.

Memory bandwidth:

– Multi-level memory hierarchy• External memory• Cache• Register file

Chapter 2: Application optimized supercomputers

Page 11: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation11

Balanced Hardware

Example caxpy:

Processor FPU throughput

[FLOPS / cycle]

Memory bandwidth

[words / cycle] [FLOPS / word]

apeNEXT 8 2 4

QCDOC (MM) 2 0.63 3.2

QCDOC (LS) 2 2 1

Xeon 2 0.29 7

GPU 128 x 2 17.3 (*) 14.8

Cell/B.E. (MM) 8 x 4 1 32

Cell/B.E. (LS) 8 x 4 8 x 4 2

Chapter 2: Application optimized supercomputers

Page 12: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation12

Cell/B.E. ArchitectureChapter 2: Application optimized supercomputers

Page 13: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Balanced Systems ?!?

13

Chapter 2: Application optimized supercomputers

Page 14: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

… but are they Reliable, Available and Serviceable ?!?

14

Chapter 2: Application optimized supercomputers

Page 15: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation15

Collaboration and Credits

QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

Academic Partners– University Regensburg S. Heybrock, D. Hierl, T. Maurer, N. Meyer, A. Nobile, A. Schaefer, S. Solbrig, T. Streuer, T. Wettig

– University Wuppertal Z. Fodor, A. Frommer, M. Huesken

– University Ferrara M. Pivanti, F. Schifano, R. Tripiccione

– University Milano H. Simma

– DESY Zeuthen D.Pleiter, K.-H. Sulanke, F. Winter

– Research Lab Juelich M. Drochner, N. Eicker, T. Lippert

Industrial Partner– IBM (DE, US, FR) H. Baier, H. Boettiger, A. Castellane, J.-F. Fauh, U. Fischer, G. Goldrian, C. Gomez, T. Huth, B. Krill,

J. Lauritsen, J. McFadden, I. Ouda, M. Ries, H.J. Schick, J.-S. Vogt

Main Funding– DFG (SFB TR55), IBM

Support by Others– Eurotech (IT) , Knuerr (DE), Xilinx (US)

Chapter 3: QPACE

Page 16: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Project Timetable

01/08 Official project start

06/08 Node card bring-up

10/08 Fully populated backplane

01/09 Hardware integration tests

02-03/09 Release to manufacturing

05/09 Integration of 1st rack

07/09 Deployment of 2 racks at JSC

08/09 Deployment of 4 racks at JSC and 4 racks at University Wuppertal complete

16

Page 17: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Production Chain

Major steps– Pre-integration at University Regensburg– Integration at IBM / Boeblingen– Installation at FZ Juelich and University Wuppertal

17

Page 18: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation18

Concept

System– Node card with IBM® PowerXCell™ 8i processor and network processor (NWP)

• Important feature: fast double precision arithmetic's– Commodity processor interconnected by a custom network– Custom system design– Liquid cooling system

Rack parameters– 256 node cards

• 26 TFLOPS peak (double precision)• 1 TB Memory

– O(35) kWatt power consumption

Applications– Target sustained performance of 20-30%– Optimized for calculations in theoretical particle physics:

Simulation of Quantum Chromodynamics

Chapter 3: QPACE

Page 19: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation19

Networks

Torus network– Nearest-neighbor communication, 3-dimensional torus topology– Aggregate bandwidth 6 GByte/s per node and direction– Remote DMA communication (local store to local store)

Interrupt tree network– Evaluation of global conditions and synchronization– Global Exceptions– 2 signals per direction

Ethernet network– 1 Gigabit Ethernet link per node card to rack-level switches (switched network)– I/O to parallel file system (user input / output)– Linux network boot– Aim of O(10) GB bandwidth per rack

Chapter 3: QPACE

Page 20: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation20

Backplane(8 per rack)

Power Supply and Power Adapter Card(24 per rack)

Rack

Root Card(16 per rack)

Node Card(256 per rack)

Chapter 3: QPACE

Page 21: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation21

Node Card

Components– IBM PowerXCell 8i processor 3.2 GHZ– 4 Gigabyte DDR2 memory 800 MHZ with ECC– Network processor (NWP) Xilinx FPGA LX110T FPGA– Ethernet PHY – 6 x 1GB/s external links using PCI Express physical layer– Service Processor (SP) Freescale 52211– FLASH (firmware and FPGA configuration)– Power subsystem– Clocking

Network Processor– FLEXIO interface to PowerXCell 8i processor, 2 bytes with 3 GHZ bit rate – Gigabit Ethernet– UART FW Linux console– UART SP communication– SPI Master (boot flash)– SPI Slave for training and configuration– GPIO

Chapter 3: QPACE

Page 22: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation22

MemoryPowerXCell 8iProcessor

Network Processor(FPGA)

Network PHYs

Node CardChapter 3: QPACE

Page 23: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation23

PowerXCell 8i

FPGA Virtex-5

800MHz

PowerSubsystem

FLEXIO6GB/s

SPI

Compute Network

SPFreescale

MCF52211

RS232

384 IO@250MHZ

4*8*2*6 = 384 IO680 available (LX110T)

FLEXIO6GB/s

Clocking

UART

Flash

6x 1GB/s PHY

SPI

I2C

SPII2C

GigE

RW(Debug)

DDR2 DDR2DDR2DDR2

PHY

Node CardChapter 3: QPACE

Page 24: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Network Processor

24

Chapter 3: QPACE

x+

x-

z-

FlexIOInterfaceFlexIO

Interface

Network Logic

Routing

Arbitration

FIFOs

Configuration

Network Logic

Routing

Arbitration

FIFOs

Configuration

PHYPHY

PHYPHY

PHYPHY

Link Interface

Link Interface

Link Interface

Link Interface

Link Interface

Link Interface

Ethernet InterfaceEthernet Interface

Global SignalsGlobal Signals

SerialInterfaces

SerialInterfaces

PHYPHY

SPI FlashSPI Flash

Slices 92 %

PINs 86 %

LUT-FF pairs 73 %

Flip-Flops 55 %

LUTs 53 %

BRAM / FIFOs 35 %

Flip-Flops LUTs

Processor Interface 53 % 46 %

Torus 36 % 39 %

Ethernet 4 % 2 %

Page 25: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation25

Network Processor

FELX iO

RocketIO

GBIF

IOC (IOIF)

MasterSlave

Switch / Address Decoder / FIFOs / Bus Controller

Send RequestsReceive Requests

IOC (IOIF)

FlexIO

IBM:

• RocketIO Logic

• IOC Logic

• GBIF Logic

Academic Partners:

• Network Processor Logic

6 x 1GB/S

Chapter 3: QPACE

Page 26: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation26

Processor Bus Interface

FlexIO Interface– High bandwidth interface between IBM PowerXCell 8i processor and Xilinx Viretx-5 FPGA– Implementation from Rambus Inc– Optimized for intra-board environments– Uses RocketIO GPT transceiver features– Requires link training after power-on

• Phase calibration (aligns the data for optimal sampling point)

• Parallel calibration (synchronizes the receive deserializer with the transmit serializer)

• Levelization calibration (aligns all data lanes)

Challenges– Speed, Latency, Bandwidth and Timing (Clock)– 3 Gbyte/sec communication channel– 2 Byte link wide

Chapter 3: QPACE

Page 27: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Torus Network Physical Layer

Physical layer– 10GbE @ 2.5 GHz → 1 GByte/s

Eye diagram for bad case link– 3.125 GHz– 40 cm PCB, 50 cm cable,– 1 PCB-PCB, 2 PCB-cable connectors

Custom data link layer– Fixed size messages– 128 Byte payload + 4 Byte header + 4 Byte CRC

→ Minimal protocol overhead

27

Chapter 3: QPACE

Page 28: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Torus Network Architecture

2-sided communication– Node A initiates send, node B initiates receive– Send and receive commands have to match– Multiple use of same link by virtual channels

Send / receive from / to local store or main memory– CPU → NWP

• CPU moves data and control info to NWP• Back-pressure controlled

– NWP → NWP• Independent of processor• Each datagram has to be acknowledged

– NWP → CPU• CPU provides credits to NWP• NWP writes data into processor• Completion indicated by notification

28

Page 29: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Torus Network Reconfiguration

Torus network PHYs provide 2 interfaces– Used for network reconfiguration b selecting primary or secondary interface

Example– 1x8 or 2x4 node-cards

Partition sizes (1,2,2N) * (1,2,4,8,16) * (1,2,4,8)– N ... number of racks connected via cables

29

Chapter 3: QPACE

Page 30: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Cooling

Concept– Node card mounted in housing = heat conductor– Housing connected to liquid cooled cold plate– Critical thermal interfaces

• Processor – thermal box• Thermal box – cold plate

– Dry connection between node card and cooling circuit

Node card housing – Closed node card housing acts as heat conductor.– Heat conductor is linked with liquid-cooled “cold plate”– Cold Plate is placed between two rows of node cards.

Simulation Results for one Cold Plate– Ambient 12°C– Water 10 L / min– Load 4224 Watt

2112 Watt / side

30

Chapter 3: QPACE

Page 31: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Power Efficiency

31

Chapter 3: QPACE

Page 32: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Project Review

Hardware design– Almost all critical problems solved in time– Network Processor implementation still a challenge– No serious problems due to wrong design decisions

Hardware status– Manufacturing quality good: Small bone pile, few defects during operation.

Time schedule– Essentially stayed within planned schedule– Implementation of system / application software delayed

32

Chapter 4: Review and Summary

Page 33: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation

Summary

QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.

Design highlights– FPGA directly attached to processor– LQCD optimized, low latency torus network– Novel, cost-efficient liquid cooling system– High packaging density– Very power efficient architecture

O(20-30%) sustained performance for key LQCD kernels is reached / feasible

→ O(10-16) TFLOPS / rack (SP)

33

Chapter 4: Review and Summary

Page 34: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation34

Chapter 5: Unforgettable Impressions ;-)

Page 35: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation35

Chapter 5: Unforgettable Impressions ;-)

Page 36: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation36

Chapter 5: Unforgettable Impressions ;-)

Page 37: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation37

Chapter 5: Unforgettable Impressions ;-)

Page 38: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation38

Chapter 5: Unforgettable Impressions ;-)

Page 39: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation39

Chapter 5: Unforgettable Impressions ;-)

Page 40: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation40

Chapter 5: Unforgettable Impressions ;-)

Page 41: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation41

Chapter 5: Unforgettable Impressions ;-)

Page 42: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation42

Chapter 5: Unforgettable Impressions ;-)

Page 43: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation43

Chapter 5: Unforgettable Impressions ;-)

Page 44: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation44

Chapter 5: Unforgettable Impressions ;-)

Page 45: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation45

Page 46: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation46

Thank you very much for your attention.

Page 47: QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

© 2009 IBM Corporation47

Disclaimer

IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries.

Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.

Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change.