Top Banner
Design issues of IBM Cell Design issues of IBM Cell Architecture Architecture Vitthal Gutthe MEIT 1326 Vitthal Gutthe MEIT 1326 Pravin kumar Yadav MEIT 1338 Pravin kumar Yadav MEIT 1338 Vyanktesh Dorlikar MEIT 1324 Vyanktesh Dorlikar MEIT 1324
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ibm cell

Design issues of IBM CellDesign issues of IBM CellArchitectureArchitecture

Vitthal Gutthe MEIT 1326Vitthal Gutthe MEIT 1326Pravin kumar Yadav MEIT 1338Pravin kumar Yadav MEIT 1338Vyanktesh Dorlikar MEIT 1324Vyanktesh Dorlikar MEIT 1324

Page 2: Ibm cell

contentscontents

General Introduction General Introduction History of developmentHistory of development Technical overview of architectureTechnical overview of architecture Detailed technical discussion of Detailed technical discussion of

componentscomponents Design choicesDesign choices Cell programming issuesCell programming issues

Page 3: Ibm cell

History of DevelopmentHistory of Development

Sony Playstation2Sony Playstation2• Released March 2000 in JapanReleased March 2000 in Japan• 128bit “Emotion Engine”128bit “Emotion Engine”• With freq. of 294Mhz,MIPS CPUWith freq. of 294Mhz,MIPS CPU• Having capability of 6.2gflops(gega Having capability of 6.2gflops(gega

floating point operation per second)floating point operation per second)

Page 4: Ibm cell

History ContinuedHistory Continued

Partnership between Sony, Toshiba, Partnership between Sony, Toshiba, IBM in Summer of 2000IBM in Summer of 2000

Initial goal of 1000 x PS2 Power in Initial goal of 1000 x PS2 Power in single Machinesingle Machine

March 2001, Sony-IBM-Toshiba March 2001, Sony-IBM-Toshiba design center opened with an design center opened with an investment of $400m investment.investment of $400m investment.

Page 5: Ibm cell

Overall Goals for CellOverall Goals for Cell

High performance in multimedia appsHigh performance in multimedia apps Gain Real time performanceGain Real time performance Power consumption should be Power consumption should be

minimumminimum Cost as low as possibleCost as low as possible Available by 2005Available by 2005 Avoid memory latency issues Avoid memory latency issues

associated with control structuresassociated with control structures

Page 6: Ibm cell

The Cell itselfThe Cell itself Power PC based Power PC based

main core (PPE)main core (PPE) Multiple Multiple

SPEs(Synergistic)SPEs(Synergistic) On die memory On die memory

controllercontroller Inter-core Inter-core

transport bustransport bus High speed IOHigh speed IO

Page 7: Ibm cell

Cell Die LayoutCell Die Layout

Page 8: Ibm cell

Cell ImplementationCell Implementation

Cell is an architectureCell is an architecture Preliminary ImplementationPreliminary Implementation

• 1 PPE1 PPE• 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase)• 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process• Clocked at freq. 3-4ghzClocked at freq. 3-4ghz• 256GFLOPS Single Precision @ 4ghz 256GFLOPS Single Precision @ 4ghz

Page 9: Ibm cell

Why a Cell ArchitectureWhy a Cell Architecture

Follows a trend in computing Follows a trend in computing architecturearchitecture

Natural extension of dual and multi-Natural extension of dual and multi-corecore

Extremely low hardware overheadExtremely low hardware overhead Software controllableSoftware controllable Specialized hardware more useful for Specialized hardware more useful for

multimediamultimedia

Page 10: Ibm cell

Possible UsesPossible Uses

Playstation3 Playstation3 (Obviously)(Obviously)

Blade servers (IBM)Blade servers (IBM)• Amazing single Amazing single

precision FP precision FP performanceperformance

• Scientific applicationsScientific applications Toshiba HDTV Toshiba HDTV

productsproducts

Page 11: Ibm cell

Power Processing ElementPower Processing Element

PowerPC instruction set with AltiVecPowerPC instruction set with AltiVec Used for general purpose computing Used for general purpose computing

and controlling SPE’sand controlling SPE’s Simultaneous MultithreadingSimultaneous Multithreading Separate 32 KB L1 Caches and Separate 32 KB L1 Caches and

unified 512 KB L2 Cacheunified 512 KB L2 Cache

Page 12: Ibm cell

PPE (cont.)PPE (cont.)

Slow but power efficient PowerPC Slow but power efficient PowerPC instruction set implementationinstruction set implementation

Two issue in-order instruction fetchTwo issue in-order instruction fetch Conspicuous lack of instruction Conspicuous lack of instruction

windowwindow Compare to conventional PowerPC Compare to conventional PowerPC

implementations (G5)implementations (G5) Performance depends on SPE Performance depends on SPE

utilizationutilization

Page 13: Ibm cell

Synergistic Processing Element (SPE)Synergistic Processing Element (SPE)

Specialized hardwareSpecialized hardware Meant to be used in Meant to be used in

parallelparallel• (7 on PS3 (7 on PS3

implementation)implementation) On chip memory (256kb)On chip memory (256kb) No branch predictionNo branch prediction In-order executionIn-order execution Dual issueDual issue

Page 14: Ibm cell

SPE ArchitectureSPE Architecture

0.99µm2 on 90nm Process0.99µm2 on 90nm Process 128 registers (128 bits wide)128 registers (128 bits wide)

• Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit Variant of VMX instruction setVariant of VMX instruction set

• Modified for 128 registersModified for 128 registers On chip memory is NOT a cacheOn chip memory is NOT a cache

Page 15: Ibm cell

SPE ExecutionSPE Execution

Dual issue, in-orderDual issue, in-order Seven execution unitsSeven execution units Vector logicVector logic 8 single precision operations per 8 single precision operations per

cyclecycle Significant performance hit for Significant performance hit for

double precisiondouble precision

Page 16: Ibm cell

SPE Execution DiagramSPE Execution Diagram

Page 17: Ibm cell

SPE Local Storage AreaSPE Local Storage Area

NOT a cacheNOT a cache 256kb, 4 x 64kb ECC single port 256kb, 4 x 64kb ECC single port

SRAMSRAM Completely private to each SPECompletely private to each SPE Directly addressable by softwareDirectly addressable by software Can be used as a cache, but only Can be used as a cache, but only

with software controlswith software controls No tag bits, or any extra hardwareNo tag bits, or any extra hardware

Page 18: Ibm cell

SPE LS SchedulingSPE LS Scheduling

Software controlled DMASoftware controlled DMA DMA to and from main memoryDMA to and from main memory Scheduling a HUGE problemScheduling a HUGE problem

• Done primarily in softwareDone primarily in software• IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally

Request queue handles 16 simultaneous Request queue handles 16 simultaneous requestsrequests• Up to 16 kb transfer eachUp to 16 kb transfer each• Priority: DMA, L/S, Fetch Priority: DMA, L/S, Fetch

Fetch / execute parallelismFetch / execute parallelism

Page 19: Ibm cell

SPE Control LogicSPE Control Logic

Very little in comparisonVery little in comparison Represents shift in focusRepresents shift in focus Complete lack of branch predictionComplete lack of branch prediction

• Software branch predictionSoftware branch prediction• Loop unrollingLoop unrolling• 18 cycle penalty18 cycle penalty

Software controlled DMASoftware controlled DMA

Page 20: Ibm cell

SPE PipelineSPE Pipeline

Little ILP, and thus Little ILP, and thus little control logiclittle control logic

Dual issueDual issue Simple commit Simple commit

unit (no reorder unit (no reorder buffer or other buffer or other complexities)complexities)

Same execution Same execution unit for FP/intunit for FP/int

Page 21: Ibm cell

SPE SummarySPE Summary

Essentially small vector computerEssentially small vector computer Based on Altivec/VMX ISABased on Altivec/VMX ISA

• Extensions for DMA and LS managementExtensions for DMA and LS management• Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile

Uniquely suited for real time applicationsUniquely suited for real time applications Extremely fast for certain FP operationsExtremely fast for certain FP operations Offload a large amount on to compiler / Offload a large amount on to compiler /

software.software.

Page 22: Ibm cell

Element Interconnect BusElement Interconnect Bus

4 concentric rings connecting all Cell 4 concentric rings connecting all Cell elementselements

128-bit wide interconnects128-bit wide interconnects

Page 23: Ibm cell

EIB (cont.)EIB (cont.)

Designed to minimize coupling noiseDesigned to minimize coupling noise Rings of data traveling in alternating Rings of data traveling in alternating

directionsdirections Buffers and repeaters at each SPE Buffers and repeaters at each SPE

boundaryboundary Architecture can be scaled up with Architecture can be scaled up with

increased bus latencyincreased bus latency

Page 24: Ibm cell

EIB (cont.)EIB (cont.)

Total bandwidth at ~200GB/sTotal bandwidth at ~200GB/s EIB controller located physically in EIB controller located physically in

center of chip between SPE’scenter of chip between SPE’s Controller reserves channels for each Controller reserves channels for each

individual data transfer requestindividual data transfer request Implementation allows for SPE Implementation allows for SPE

extension horizontallyextension horizontally

Page 25: Ibm cell

Memory InterfaceMemory Interface

Rambus XDR memory to keep Cell at Rambus XDR memory to keep Cell at full utilizationfull utilization

3.2 Gbps data bandwidth per device 3.2 Gbps data bandwidth per device connected to XDR interfaceconnected to XDR interface

Cell uses dual channel XDR with four Cell uses dual channel XDR with four devices and 16-bit wide buses to devices and 16-bit wide buses to achieve 25.2 GB/s total memory achieve 25.2 GB/s total memory bandwidthbandwidth

Page 26: Ibm cell

Input / Output BusInput / Output Bus

Rambus FlexIO BusRambus FlexIO Bus IO interface consists of 12 IO interface consists of 12

unidirectional byte lanesunidirectional byte lanes Each lane supports 6.4 GB/s Each lane supports 6.4 GB/s

bandwidthbandwidth 7 outbound lanes and 5 inbound 7 outbound lanes and 5 inbound

laneslanes

Page 27: Ibm cell

Design ChoicesDesign Choices

In-order executionIn-order execution• Abandoning ILPAbandoning ILP• ILP – 10-20% increase per generationILP – 10-20% increase per generation• Reducing control logicReducing control logic• Real time responsivenessReal time responsiveness

Cache DesignCache Design• Software configuration on SPESoftware configuration on SPE• Standard L2 cache on PPEStandard L2 cache on PPE

Page 28: Ibm cell

Cell Programming IssuesCell Programming Issues

No Cell compiler in existence to manage No Cell compiler in existence to manage utilization of SPE’s at compile timeutilization of SPE’s at compile time

SPE’s do not natively support context SPE’s do not natively support context switching. Must be OS managed.switching. Must be OS managed.

SPE’s are vector processors. Not efficient SPE’s are vector processors. Not efficient for general-purpose computation.for general-purpose computation.

PPE’s and SPE’s use different instruction PPE’s and SPE’s use different instruction sets.sets.

Page 29: Ibm cell

Cell Programming (cont.)Cell Programming (cont.)

Functional Offload ModelFunctional Offload Model Simplest model for Cell programmingSimplest model for Cell programming Optimize existing libraries for SPE Optimize existing libraries for SPE

computationcomputation Requires no rebuild of main Requires no rebuild of main

application logic which runs on PPEapplication logic which runs on PPE

Page 30: Ibm cell

RefrencesRefrences

• "Synergistic Processing in Cell's Multicore Architecture"(PDF). IEEE. Retrieved 2007-03-22.•Jump up^ "Cell Designer talks about PS3 and IBM Cell Processors". Retrieved 2007-03-22.•Jump up^ "Cell Broadband Engine Interconnect and Memory Interface"(PDF). IBM. Retrieved 2007-03-22.•http://en.wikipedia.org/wiki/Cell_(microprocessor)