cellproc(3)

8/3/2019 cellproc(3)

1/21

1/21

Cell Processor

Systems Seminar

Diana Palsetia(11/21/2006)


2/21

2/21

Background

Joint collaboration of IBM/Sony/Toshiba

Develop a new/next-gen processor

Initially for Play Station 3

Others, multimedia application (Blu-ray, HDTV)

Server systems


3/21

3/21

Objective

Outstanding performance

Overcome memory wall Improve power efficiency

Sustain high frequency without increase in pipelinedepth

Real-time response to user

Visual, sound & other sensory feedback

Connect to internet (able to handle variety of workloads)

Applicable for wide range of platforms

Next-generation consumerelectronic systems &beyond


4/21

4/21

Synergist

ic Processing Element


5/21

5/21

Power Processor Element (PPE)

The PPE is a 64 bit, "Power Architecture

capable of running POWER or PowerPC binaries

Extended Vector Scalar Unit (VSU)

The PPE is

In-order

Dual threaded

Dual Issue


6/21

6/21

PPE components

Copyright: IBM


7/21

7/21

Synergistic Processing Elements

An SPE is a self contained vector processor (SIMD) whichacts as a co-processor SPEs ISA a cross between VMX and the PS2s Emotion Engine.

In-order (again to minimize circuitry to save power)

Statically scheduled (compiler plays big role) Also no dynamic prediction hardware (relies on compiler

generated hints)

Each SPE consists of: 128 x 128 register

Local Store (SRAM)

DMA unit

FP, LD/ST, Permute, Branch Unit (each pipelined)


8/21

8/21

SPE Architecture

Copyright: IBM


9/21

9/21

SPE Local Store

Each SPE has local on-chip memory a.k.a Local Store(LS) serves a secondary register file (not as cache)

Avoids coherence logic needed caches as well cachemiss penalty

Is mapped into memory map of the processor allow LSto LS transactions

128 bit instruction fetch, load and store operation

7 out of every 8 cycles

Data/instructions are transferred bet. LS and systemmemory/other SPEs LS using DMA unit

128 bytes at a time(transfer rate of 0.5 terabytes/sec)

DMA transactions are coherent


10/21

10/21

SPE DMA Unit

Contains the Memory Flow Controller(MFC)

Interface uses Power Architecture page protection

model

MFC has its own Memory Management Unit (MMU)that is subset of Power cores MMU

This allows consistent interface to system storage map

for all processors despite it heterogeneous structure


11/21

11/21

Floating Point Performance

Both PPE and SPE have Vector instruction

capability

Esp. each SPU can complete

2 double precision operations per clock cycle -translates to 6.4 GFLOPS at 3.2 GHz

OR

8 single precision operations per clock cycle

translates to 25.6 GFLOPS at 3.2 GHz


12/21

12/21

Element Interconnect Bus

Connects various on chip elements

PPE , 8 SPEs, memory controller (MIC) & off-chip I/Ointerfaces

Data-ring structure with control of a bus

4 unidirectional rings but 2 rings run counter directionto other 2

Worst-case maximum latency is only half distance ofthe ring

Each ring is 16 bytes wide and runs at half thecore clock frequency (core clock freq ~3.2 GHz)


13/21

13/21

Memory and I/O

Cell needs tremendous amount of memory and I/O

Memory Technology: Rambus XDR DRAM

Supports total bandwidth of 25.6 GB/s

I/O: Rambus FlexI/O


14/21

14/21

Programming the cell is challenging

Issues

Dividing program among different cores

Creating instructions in a different language forthe 8 SPEs than for the PowerPC core.

Need to think in terms of SIMD nature of dataflowto get maximum performance from SPUs

SPU local store needs to perform coherent DMAaccess for accessing system memory


15/21

15/21

IBM Approach

Manually partition the application into separate code

segments and use the compiler that targets the appropriate

ISA

For SPUs, SIMD code generation can be done by

parallelizing compiler with auto-SIMDization

Allocating SPE program data in system memory (shared

memory view) &have SPE compiler automatically

manage the movement of data

A naive compiler inserts an explicit DMA transfer foreach access to shared memory

optimized: employ a software cache mechanism that

permits reuse of the temporary buffers in the LS


16/21

16/21

IBM Approach (contd..)

Using the SPE linker and an embedding tool

generate a PPE executable that contains the SPE binaryembedded within the data section

PPE object is then linked, using a PPE linker

with the runtime libraries which are required for threadcreation and management, to create a bound executable

for the Cell BE program


17/21

17/21

Compiling and Binding of a program on CELL

Copyright: IBM


18/21

18/21

Programming Models

Stream processing

Serial or parallel pipelines can be setup

Example: Set-box consists of reading, video and audio

encoding, and display.

Serial: chaining SPEs and each SPE does one subtaskParallel: partition same subtask among SPEs


19/21

19/21

Programming Model

Function Offload Model

Application executes on PPE Complex library functions invoked by the main

application are offloaded onto one or more SPE

Library function(s) are optimized and recompiled for

SPE environment

SPE executable program is linked into PPE object

module as small remote function invocation stub


20/21

20/21

Current/Future Applications

Sony Play Station 3

Significant improvement over PS2

IBM Blade Server

Blade server prototype containing two cell processors

Ran at 2.4 GHz (current system run at 3.2 GHz)providing 200 GFLOPS single-precision floatingperformance per CPU

Mercury

In corporate cell based system into Military Vehicles

Used for target recognition, tracking geo-location,mapping, video processing etc


21/21

21/21

cellproc(3)

Documents