Top Banner

of 21

cellproc(3)

Apr 06, 2018

Download

Documents

Arun Mathew
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/3/2019 cellproc(3)

    1/21

    1/21

    Cell Processor

    Systems Seminar

    Diana Palsetia(11/21/2006)

  • 8/3/2019 cellproc(3)

    2/21

    2/21

    Background

    Joint collaboration of IBM/Sony/Toshiba

    Develop a new/next-gen processor

    Initially for Play Station 3

    Others, multimedia application (Blu-ray, HDTV)

    Server systems

  • 8/3/2019 cellproc(3)

    3/21

    3/21

    Objective

    Outstanding performance

    Overcome memory wall Improve power efficiency

    Sustain high frequency without increase in pipelinedepth

    Real-time response to user

    Visual, sound & other sensory feedback

    Connect to internet (able to handle variety of workloads)

    Applicable for wide range of platforms

    Next-generation consumerelectronic systems &beyond

  • 8/3/2019 cellproc(3)

    4/21

    4/21

    Synergist

    ic Processing Element

  • 8/3/2019 cellproc(3)

    5/21

    5/21

    Power Processor Element (PPE)

    The PPE is a 64 bit, "Power Architecture

    capable of running POWER or PowerPC binaries

    Extended Vector Scalar Unit (VSU)

    The PPE is

    In-order

    Dual threaded

    Dual Issue

  • 8/3/2019 cellproc(3)

    6/21

    6/21

    PPE components

    Copyright: IBM

  • 8/3/2019 cellproc(3)

    7/21

    7/21

    Synergistic Processing Elements

    An SPE is a self contained vector processor (SIMD) whichacts as a co-processor SPEs ISA a cross between VMX and the PS2s Emotion Engine.

    In-order (again to minimize circuitry to save power)

    Statically scheduled (compiler plays big role) Also no dynamic prediction hardware (relies on compiler

    generated hints)

    Each SPE consists of: 128 x 128 register

    Local Store (SRAM)

    DMA unit

    FP, LD/ST, Permute, Branch Unit (each pipelined)

  • 8/3/2019 cellproc(3)

    8/21

    8/21

    SPE Architecture

    Copyright: IBM

  • 8/3/2019 cellproc(3)

    9/21

    9/21

    SPE Local Store

    Each SPE has local on-chip memory a.k.a Local Store(LS) serves a secondary register file (not as cache)

    Avoids coherence logic needed caches as well cachemiss penalty

    Is mapped into memory map of the processor allow LSto LS transactions

    128 bit instruction fetch, load and store operation

    7 out of every 8 cycles

    Data/instructions are transferred bet. LS and systemmemory/other SPEs LS using DMA unit

    128 bytes at a time(transfer rate of 0.5 terabytes/sec)

    DMA transactions are coherent

  • 8/3/2019 cellproc(3)

    10/21

    10/21

    SPE DMA Unit

    Contains the Memory Flow Controller(MFC)

    Interface uses Power Architecture page protection

    model

    MFC has its own Memory Management Unit (MMU)that is subset of Power cores MMU

    This allows consistent interface to system storage map

    for all processors despite it heterogeneous structure

  • 8/3/2019 cellproc(3)

    11/21

    11/21

    Floating Point Performance

    Both PPE and SPE have Vector instruction

    capability

    Esp. each SPU can complete

    2 double precision operations per clock cycle -translates to 6.4 GFLOPS at 3.2 GHz

    OR

    8 single precision operations per clock cycle

    translates to 25.6 GFLOPS at 3.2 GHz

  • 8/3/2019 cellproc(3)

    12/21

    12/21

    Element Interconnect Bus

    Connects various on chip elements

    PPE , 8 SPEs, memory controller (MIC) & off-chip I/Ointerfaces

    Data-ring structure with control of a bus

    4 unidirectional rings but 2 rings run counter directionto other 2

    Worst-case maximum latency is only half distance ofthe ring

    Each ring is 16 bytes wide and runs at half thecore clock frequency (core clock freq ~3.2 GHz)

  • 8/3/2019 cellproc(3)

    13/21

    13/21

    Memory and I/O

    Cell needs tremendous amount of memory and I/O

    Memory Technology: Rambus XDR DRAM

    Supports total bandwidth of 25.6 GB/s

    I/O: Rambus FlexI/O

  • 8/3/2019 cellproc(3)

    14/21

    14/21

    Programming the cell is challenging

    Issues

    Dividing program among different cores

    Creating instructions in a different language forthe 8 SPEs than for the PowerPC core.

    Need to think in terms of SIMD nature of dataflowto get maximum performance from SPUs

    SPU local store needs to perform coherent DMAaccess for accessing system memory

  • 8/3/2019 cellproc(3)

    15/21

    15/21

    IBM Approach

    Manually partition the application into separate code

    segments and use the compiler that targets the appropriate

    ISA

    For SPUs, SIMD code generation can be done by

    parallelizing compiler with auto-SIMDization

    Allocating SPE program data in system memory (shared

    memory view) &have SPE compiler automatically

    manage the movement of data

    A naive compiler inserts an explicit DMA transfer foreach access to shared memory

    optimized: employ a software cache mechanism that

    permits reuse of the temporary buffers in the LS

  • 8/3/2019 cellproc(3)

    16/21

    16/21

    IBM Approach (contd..)

    Using the SPE linker and an embedding tool

    generate a PPE executable that contains the SPE binaryembedded within the data section

    PPE object is then linked, using a PPE linker

    with the runtime libraries which are required for threadcreation and management, to create a bound executable

    for the Cell BE program

  • 8/3/2019 cellproc(3)

    17/21

    17/21

    Compiling and Binding of a program on CELL

    Copyright: IBM

  • 8/3/2019 cellproc(3)

    18/21

    18/21

    Programming Models

    Stream processing

    Serial or parallel pipelines can be setup

    Example: Set-box consists of reading, video and audio

    encoding, and display.

    Serial: chaining SPEs and each SPE does one subtaskParallel: partition same subtask among SPEs

  • 8/3/2019 cellproc(3)

    19/21

    19/21

    Programming Model

    Function Offload Model

    Application executes on PPE Complex library functions invoked by the main

    application are offloaded onto one or more SPE

    Library function(s) are optimized and recompiled for

    SPE environment

    SPE executable program is linked into PPE object

    module as small remote function invocation stub

  • 8/3/2019 cellproc(3)

    20/21

    20/21

    Current/Future Applications

    Sony Play Station 3

    Significant improvement over PS2

    IBM Blade Server

    Blade server prototype containing two cell processors

    Ran at 2.4 GHz (current system run at 3.2 GHz)providing 200 GFLOPS single-precision floatingperformance per CPU

    Mercury

    In corporate cell based system into Military Vehicles

    Used for target recognition, tracking geo-location,mapping, video processing etc

  • 8/3/2019 cellproc(3)

    21/21

    21/21