8/3/2019 cellproc(3)
1/21
1/21
Cell Processor
Systems Seminar
Diana Palsetia(11/21/2006)
8/3/2019 cellproc(3)
2/21
2/21
Background
Joint collaboration of IBM/Sony/Toshiba
Develop a new/next-gen processor
Initially for Play Station 3
Others, multimedia application (Blu-ray, HDTV)
Server systems
8/3/2019 cellproc(3)
3/21
3/21
Objective
Outstanding performance
Overcome memory wall Improve power efficiency
Sustain high frequency without increase in pipelinedepth
Real-time response to user
Visual, sound & other sensory feedback
Connect to internet (able to handle variety of workloads)
Applicable for wide range of platforms
Next-generation consumerelectronic systems &beyond
8/3/2019 cellproc(3)
4/21
4/21
Synergist
ic Processing Element
8/3/2019 cellproc(3)
5/21
5/21
Power Processor Element (PPE)
The PPE is a 64 bit, "Power Architecture
capable of running POWER or PowerPC binaries
Extended Vector Scalar Unit (VSU)
The PPE is
In-order
Dual threaded
Dual Issue
8/3/2019 cellproc(3)
6/21
6/21
PPE components
Copyright: IBM
8/3/2019 cellproc(3)
7/21
7/21
Synergistic Processing Elements
An SPE is a self contained vector processor (SIMD) whichacts as a co-processor SPEs ISA a cross between VMX and the PS2s Emotion Engine.
In-order (again to minimize circuitry to save power)
Statically scheduled (compiler plays big role) Also no dynamic prediction hardware (relies on compiler
generated hints)
Each SPE consists of: 128 x 128 register
Local Store (SRAM)
DMA unit
FP, LD/ST, Permute, Branch Unit (each pipelined)
8/3/2019 cellproc(3)
8/21
8/21
SPE Architecture
Copyright: IBM
8/3/2019 cellproc(3)
9/21
9/21
SPE Local Store
Each SPE has local on-chip memory a.k.a Local Store(LS) serves a secondary register file (not as cache)
Avoids coherence logic needed caches as well cachemiss penalty
Is mapped into memory map of the processor allow LSto LS transactions
128 bit instruction fetch, load and store operation
7 out of every 8 cycles
Data/instructions are transferred bet. LS and systemmemory/other SPEs LS using DMA unit
128 bytes at a time(transfer rate of 0.5 terabytes/sec)
DMA transactions are coherent
8/3/2019 cellproc(3)
10/21
10/21
SPE DMA Unit
Contains the Memory Flow Controller(MFC)
Interface uses Power Architecture page protection
model
MFC has its own Memory Management Unit (MMU)that is subset of Power cores MMU
This allows consistent interface to system storage map
for all processors despite it heterogeneous structure
8/3/2019 cellproc(3)
11/21
11/21
Floating Point Performance
Both PPE and SPE have Vector instruction
capability
Esp. each SPU can complete
2 double precision operations per clock cycle -translates to 6.4 GFLOPS at 3.2 GHz
OR
8 single precision operations per clock cycle
translates to 25.6 GFLOPS at 3.2 GHz
8/3/2019 cellproc(3)
12/21
12/21
Element Interconnect Bus
Connects various on chip elements
PPE , 8 SPEs, memory controller (MIC) & off-chip I/Ointerfaces
Data-ring structure with control of a bus
4 unidirectional rings but 2 rings run counter directionto other 2
Worst-case maximum latency is only half distance ofthe ring
Each ring is 16 bytes wide and runs at half thecore clock frequency (core clock freq ~3.2 GHz)
8/3/2019 cellproc(3)
13/21
13/21
Memory and I/O
Cell needs tremendous amount of memory and I/O
Memory Technology: Rambus XDR DRAM
Supports total bandwidth of 25.6 GB/s
I/O: Rambus FlexI/O
8/3/2019 cellproc(3)
14/21
14/21
Programming the cell is challenging
Issues
Dividing program among different cores
Creating instructions in a different language forthe 8 SPEs than for the PowerPC core.
Need to think in terms of SIMD nature of dataflowto get maximum performance from SPUs
SPU local store needs to perform coherent DMAaccess for accessing system memory
8/3/2019 cellproc(3)
15/21
15/21
IBM Approach
Manually partition the application into separate code
segments and use the compiler that targets the appropriate
ISA
For SPUs, SIMD code generation can be done by
parallelizing compiler with auto-SIMDization
Allocating SPE program data in system memory (shared
memory view) &have SPE compiler automatically
manage the movement of data
A naive compiler inserts an explicit DMA transfer foreach access to shared memory
optimized: employ a software cache mechanism that
permits reuse of the temporary buffers in the LS
8/3/2019 cellproc(3)
16/21
16/21
IBM Approach (contd..)
Using the SPE linker and an embedding tool
generate a PPE executable that contains the SPE binaryembedded within the data section
PPE object is then linked, using a PPE linker
with the runtime libraries which are required for threadcreation and management, to create a bound executable
for the Cell BE program
8/3/2019 cellproc(3)
17/21
17/21
Compiling and Binding of a program on CELL
Copyright: IBM
8/3/2019 cellproc(3)
18/21
18/21
Programming Models
Stream processing
Serial or parallel pipelines can be setup
Example: Set-box consists of reading, video and audio
encoding, and display.
Serial: chaining SPEs and each SPE does one subtaskParallel: partition same subtask among SPEs
8/3/2019 cellproc(3)
19/21
19/21
Programming Model
Function Offload Model
Application executes on PPE Complex library functions invoked by the main
application are offloaded onto one or more SPE
Library function(s) are optimized and recompiled for
SPE environment
SPE executable program is linked into PPE object
module as small remote function invocation stub
8/3/2019 cellproc(3)
20/21
20/21
Current/Future Applications
Sony Play Station 3
Significant improvement over PS2
IBM Blade Server
Blade server prototype containing two cell processors
Ran at 2.4 GHz (current system run at 3.2 GHz)providing 200 GFLOPS single-precision floatingperformance per CPU
Mercury
In corporate cell based system into Military Vehicles
Used for target recognition, tracking geo-location,mapping, video processing etc
8/3/2019 cellproc(3)
21/21
21/21