1 Luca Benini ARTIST2 / UNU IIST 2007 MPSoCs – Hardware platforms Why MPSoCs? MSoC architectures Case studies Luca Benini – DEIS Università di Bologna [email protected]Luca Benini ARTIST2 / UNU IIST 2007 Roadmap continues: 90→65→45 nm “Traditional” Bus - based SoCs fit in one tile !! Architecture Evolution Communication demand is staggering, but unevenly distributed, because of architectural heterogeneity I/0 I/0 PE PE PE PE SRAM SRAM DRAM I/O I/O P E R I P H E R A L S 3D stacked main memory PE Local Memory hierarchy CPU i/o
44
Embed
MPSoCs – Hardware platforms - ArtistDesign NoE - Home Page · Luca Benini – DEIS Università di Bologna [email protected] Luca Benini ARTIST2 / UNU IIST 2007 ... Nomadik SW
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
ARM9 core16KB I- cache8KB D- cache2- way set associative150 MHz
C55x DSP core16KB I- cache8KB RAM set2- way set associative200 MHz
D-Cache
Luca Benini ARTIST2 / UNU IIST 2007
OMAPI Standard (ST/TI)
Goal: standardize the interfaces between application processor and peripheral devices in a mobile productProvide standard services (APIs) in the OS that can be used by application developers
10
Luca Benini ARTIST2 / UNU IIST 2007
STMicro Nomadik platformMain Core
Memory System HW Accelerators I/Os
Luca Benini ARTIST2 / UNU IIST 2007
Nomadik SW platform
Compliant with OMAPI standard
11
Luca Benini ARTIST2 / UNU IIST 2007
Scalable VLIW Media Processor:• 100 to 300+ MHz• 32-bit or 64-bit
Nexperia™
System Buses• 32-128 bit
General-purpose Scalable RISC Processor• 50 to 300+ MHz• 32-bit or 64-bit
Library of DeviceIP Blocks• Image coprocessors• DSPs• UART• 1394• USB…and more
Fully integrated DSP capabilitiesSingle precision floating point unit (FPU)80 MHz at full industrial temperature range
32-bit peripheral control processor withsingle cycle instruction (PCP2)Memories
1.5 MByte embedded progr. flash with ECC32 KByte data flash - EEPROM emulation56 KBSRAM, 8 KB I$, 16 KB Imem
8-channel DMA controllerInterrupt system with 2 x 255 hardware priority arbitration levels serviced byCPU and PCP2 CoprocessorTriple bus structure: 64-bit local memory buses to internal flash and data memory, 32-bit system peripheral bus, 32-bit remote peripheral bus
Luca Benini ARTIST2 / UNU IIST 2007
HW layerHW layer
SW Platform layer(> 60% of total SW)SW Platform layer(> 60% of total SW)
Application Platform layer(≅ 10% of total SW)
Application Platform layer(≅ 10% of total SW)
μControllers Library
OSEKRTOS
OSEKCOM
I/O drivers & handlers(> 20 configurable modules)
Application Programming Interface
Boot Loader
Sys. Config.
Transport
KWP 2000
CCP
ApplicationSpecificSoftware
Speedometer
Tachometer
Water tem
p.
Speedometer
Tachometer
Odom
eter---------------
ApplicationLibraries
Nec78kNec78k HC12HC12HC08HC08 H8S26H8S26 MB90MB90
SW Platform Reuse> 70%
of total SW
SW Platform Reuse> 70%
of total SW
CustomerLibraries
MOSAIC SW Architecture & Components forAutomotive Dashboard and Body Control
Parallelism at Three Levels in Extensible Instructions
Parallelism: L x M x NExample: 3 x 4 x 3 = 36 ops/cycle
op
op
N dependent operations
implemented as single
fused operation
const
register and constant inputs
reg
Fused operation
reg reg reg
op
Three forms of instruction-set parallelism:• Very Long Instruction Word (VLIW)• Single Instruction Multiple Data (SIMD) aka “vectors”• Fused operations aka “complex operations”
Various commercial products are available since 2000IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix
SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSPRecently, many Japanese vendors start to develop commercial products
FujitsuHitachiLucentSanyoToshiba (Mep+D-Fabrix)
Luca Benini ARTIST2 / UNU IIST 2007
Processing ElementSpecialized for media/stream processing
Coarse grain ⇔ Fine grain: LUT of FPGAsComponents
ALUShifter+Mask unitMultiplexersRegisters
Operations and interconnection between components are changeableNo instruction fetch mechanism : A part of large datapath
16
Luca Benini ARTIST2 / UNU IIST 2007
1 3 8 16 ManyTime-multiplexing
Number of nodes
Gates Number
10
100
100032bitALU/Registers
8bit ALU/registers
4・5inputLUT
FPGAVLIW
Chip-Multiprocessor
ACMDAPDNA-2
DRP-1
KilocoreDRL
Dynamically reconfigurable Processors
CS2112rDSPPC101
PARS
SimpleRISC
SuperScalar
10
100
1000ト
10K
100K
1M
10M
Superscalar
Cost
ADRES
Luca Benini ARTIST2 / UNU IIST 2007
Putting it all together
Constant SoC Die SizeSlow evolution of peripherals (area decrease)GP CPU sub- system complexity 2x each node (constant area),Embedded Memory capacity 2x at each node (constant area)Loosely coupled DSP sub- system complexity increase by 30% at each node (30% area decrease)
Main trendsHost CPU evolving toward multi-core architecture to meet the performance increase requirementsHW acceleration mapped on reconfigurable arrays
Performances close to dedicated HW in many areasGood fit with regular design constraints imposed by 45nm process and beyondExcellent structure for best optimized power managementAnd … FLEXIBILITY …
Luca Benini ARTIST2 / UNU IIST 2007
Reconfigurable HW (DSP fabric)
Target signal processing and arithmetic intensive applications
Reconfigurable array of simple DSP core (CNode)
Low power architectureHierarchical clock gatingDistributed leakage control (fine grain power gating)
Programmable DMA engine
Reconfigurable at run time, multi task
18
Luca Benini ARTIST2 / UNU IIST 2007
Mapping Flow
• Alus execute a cyclic micro-sequence
• Data exchanges through hierarchical clustered interconnect
• Configuration step is sequence loading and interconnect programming
Data in Data out
ILP + software pipelining
Procedure(In,Out,inout)
Constant A,b,c,…;
Begin
X=a-in[0];
……..
End;
Behavioral code
Data in Data out
Data in Data out
Data in
Data out
Partitioning/static scheduling
DFG
Coarse grained configuration
MUX Clusters
Level0
Mux level 2
N0_i
N0_o
N2_oN2_i
N1_i N1_o
Level 1
Luca Benini ARTIST2 / UNU IIST 2007
Mapping Flow 3D optimization problem (place/route/schedule)
Traditional scheduling techniques for VLIW or clustered VLIW don’t apply
The solution don’t take into account the spatial dimension of the problem
Traditional P&R used in FPGA don't apply neither because they don't consider the time dimension
19
Luca Benini ARTIST2 / UNU IIST 2007
Interconnect
4MB Multi-port Embedded
Memory HostCore 2
L1L2
Peripherals& analog
What can fit in 45mm² in 45nm
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
Programmable Multimedia Accelerator
ImagingH/W192 CNode
(40 GOPS)
HostCore 1
L1
VideoH/W
Luca Benini ARTIST2 / UNU IIST 2007
Case Study: GPUs
20
Luca Benini ARTIST2 / UNU IIST 2007
Mobile graphics platforms
300-400 million mobile phones with graphics hardware (OpenGL ES) by 2009
Luca Benini ARTIST2 / UNU IIST 2007
The 3D Graphics Pipeline
Application
Scene Management
Geometry
Rasterization
Pixel Processing
Display
FrameBuffer
Memory
Host
GPU
1. The programmer ”sends” primitives to be rendered through the pipeline (using API calls)
2. The geometry stage does per-vertex operations
3. The rasterizer stage does per-pixel operations
• Move objects (MMUL)
• Move the camera (MMUL)
• Compute lighting at vertices of triangle
• Project onto screen (3D to 2D)
• Clipping
• Map to window
• Given a triangle, identify every pixel that belongs to that triangle−A pixel belongs to a triangle if and
only if the center of the pixel is located in the interior of the triangle
−Evaluate 3 edge equations of the form E=Ax+By+C, where E=0 is exactly on the line, and positive E is towards the interior of the triangle.
• Interpolate colors over the triangle (Gouraud interpolation)
• Put images on triangles (texturing)
• Ensure that only what is visible from the camera is displayed (z-buffering)
• Front buffer is displayed, back buffer is rendered to (double buffering)
21
Luca Benini ARTIST2 / UNU IIST 2007
Why is it hard with 3D graphics on mobile devices?
Small amount of memoryLimited instruction setLow clock frequency
100-200 MHz ARM9–400-600 MHz ARM11Small area on the chip for CGMust be cheap and physically smallPowered by batteries!
A memory access is one of the most expensive operationsBattery growth: 9% per yearPerformance growth: 40% per year
Small display, but very close to the eyeAvg. Eye-to-pixel angle 1-4x larger than for desktop
Limited resources, but high quality rendering!
Luca Benini ARTIST2 / UNU IIST 2007
PowerVR MBX low-power GPU Architecture
Tile acceleratorImage synthesis processorTexture and shading processor
TA ISP TSP
FeaturesTile-based rendering ITC™: PowerVR internal true color: color ops on-chip at 32-bpp FSAA4Free™: full screen anti-aliasing for realism at mobile display resolutionsPVR-TC™: texture compression for small memory footprints.
Create a triangle list for each tileHolds pointers to all triangles overlapping a tile
Luca Benini ARTIST2 / UNU IIST 2007
Tiled processing Process one tile at a time, and rasterize triangles in listWork on local (on-chip) tile buffers
Color, depth, stencilCopy color tile buffer to off-chip display buffer
may need to copy depth buffer as well
23
Luca Benini ARTIST2 / UNU IIST 2007
P, TSP
CPU sends triangle data to MBXTile Accelerator (TA): sorts triangles, and creates a list of triangle pointers for each tile
Needs an entire scene before ISP and TSP blocks can startSo TA works on the next image, while ISP and TSP work on the current image (i.e., they work in a pipelined fashion)
Image synthesis processor (ISP): implements Z-buffer, color buffer, stencil buffer for tile
Depth testing: test 32 pixels at a time against Z-bufferRecords which pixels are visibleGroups pixels with same texture and sends to TSPThese are guaranteed to be visible, so we only texture each pixel once (deferred texturing)
Texture and Shading Processor (TSP): Handles texturing and shading interpolation
Uses texture compressionPerforms over-sampling
TA ISP TSP
Luca Benini ARTIST2 / UNU IIST 2007
API Confusion3DR, Reality Lab, BRender, RenderWare
Mobile 3D APIThe Mobile 3D industry is embryonic - and moving fast!We are where PC graphics were in 1996 - but evolving 2-3 times faster!
Just nine months since OpenGL ES 1.0 releasedCompliant graphics acceleration already on the market
OpenGL ES has become the industry standard for embedded graphics
We avoided two years of API indecision that occurred on the PC
1992OpenGL 1.0
Created
1994OpenGL on
Windows
1995First OpenGL
HW on Windows
1996OpenGL HW Commonplace
Mid-2003OpenGL ES 1.0
Created
2004First OpenGL ES
hardware
Mid-2005OpenGL ES HW
Commonplace
< Two Years
Four Years
24
Luca Benini ARTIST2 / UNU IIST 2007
The case for a higher abstractionA game is much more than just 3D rendering
Objects, properties, relations (scene graph)Key frame and other animationsEtc. (game logic, sounds, …)
If everything else but rendering is in JavaA very large percentage of the processing is in slow JavaEven if rendering was 100% in HW, total acceleration remains limited
A higher level API could helpMore of the functionality could be implemented in native (=faster) codeOnly the game logic must remain in Java
M3G (JSR-184), a new APINodes and scene graphExtensive animation supportBinary file format and loader
Luca Benini ARTIST2 / UNU IIST 2007
Freescale iMX31
CPU: ARM11up to 665MHzVFP – Vector Floating Point Co-processorImage Processing Unit (IPU)MPEG-4 HW EncoderHW Power Management
DVFSC, Power & Clock Gating
GPU: MBX R-S 3DPowerVR MBX architecture
System-On-Chip for mobile multimedia clients
25
Luca Benini ARTIST2 / UNU IIST 2007
Programmable GPU ModelApplication
Vertex Program
Rasterization
Fragment Program
Memory Display
Luca Benini ARTIST2 / UNU IIST 2007
xform
xform
xform
Light
Vtx Coords
Normals
Colors
TexCoords
Primitive assembly
CullClip
Viewport
Rasterize
Z-testStencilScissor Blend
VERTICES TRIANGLES FRAGMENTSAPPLICATION
State VectorState cmds
Input
Indices
Attribute 0
…
…
Attribute n-1
tex0
tex1
fog
What Changes From ES 1.1 to ES 2.0?General-purpose attributes replace fixed input arraysVertex shader programs replace transform and lightingGeneral-purpose uniforms replace fixed lighting & texture stateGeneral-purpose varyings replace fixed fragment attributesFragment shader programs replace texture / fog / alpha test
The OpenGL ES 2.0 Pipeline
Vertex Processor Fragment Processor
26
Luca Benini ARTIST2 / UNU IIST 2007
PowerVR SGXAdvanced shader-based GPU (OpenGL ES 2.0 compliant)
USSE: scalable programmable, multi-threaded engine for graphics, video, imaging and other mathematically-intensive tasks.
Tasks are automatically broken down into processing packets which are then scheduled across a number of multi-threaded execution units Coprocessors (texture, pixel and tiling accelerators) assist the MT EUs
Latency tolerant architecture geometry and rasterisation are decoupled using tile-based rendering, enabling on-chip processing hidden-surface removal and deferred pixel shading
ScalableDon’t want to change the way I design architecture even if requirements scale up exponentially
PredictableI want to know what to expect (latency, bandwidth), and I want to be able to negotiate it
RobustKeeps going and going… Even if something is broken inside
EfficientSilicon is expensive, power is precious
EasyTo create, update, analyze, verify
31
Luca Benini ARTIST2 / UNU IIST 2007
Addressing Interconnect Issues
High-end industrial solutions:Evolutionary path from shared busses
AMBA AXI
Protocol evolutionsAMBA AHB
AMBA AHB ML
Topology evolutions
ChallengesComplexity (e.g. 4-SHB + 2XBar, 75 actors): how to analyze and verify “spaghetti interconnects”?Scalability: bus is bandwidth-limited, Xbar is size-limitedPredictability: how to tie interconnects with floorplanning
AHB
AHB
AHB
Luca Benini ARTIST2 / UNU IIST 2007
The Network-on-Chip Paradigm
DSPNI
NIDRAM
switch
DMANI
CPU NI
NIAccelNI MPEG
switch
switch
switch
NoC
switch
switch
The “power of NoCs”:Clean separation at session layer
Cores issue end-to-end transactionsNetwork deals with transport, network, link, physical
Modularity at HW level: only2 building blocks
Network interfaceSwitch (router)
Physical design aware (floorplanglobal routing)
Scalability is supported from the ground up!
32
Luca Benini ARTIST2 / UNU IIST 2007
Building blocks: NI
Session-layer interface with nodesBack-end manages interface with switches
Front end
Backend
Standardized node interface @ session layer. Initiator vs. target distinction is blurred
1. Supported transactions (e.g. QoSread…)2. Degree of parallelism3. Session prot. control flow & negotiation
NoC specific backend (layers 1-4)1. Physical channel interface2. Link-level protocol3. Network-layer (packetization)4. Transport layer (routing)
Node Switches
Luca Benini ARTIST2 / UNU IIST 2007
Building blocks: Switch
Router: receives and forwards packetsNOTE: Packet-based does not mean datagram!
Level 3 or Level 4 routingNo consensus, but generally L4 support is limited (e.g. simple routing)
Crossbar
AllocatorArbiter
Output buffers& control flow
Input buffers& control flow
QoS &Routing
Data portswith control flowwires
33
Luca Benini ARTIST2 / UNU IIST 2007
Æthereal: context
Consumer electronicsreliability & predictability are essentiallow cost is crucialtime to market must be reduced
synchronous, using slot tablestime-division multiplexed circuits
Store- and- forward routingHeaderless packets
information is present in slot table
35
Luca Benini ARTIST2 / UNU IIST 2007
Contention-free routing
Latency guarantees are easy in circuit switchingEmulate circuits with packet switchingSchedule packet injection in networksuch that they never contend for same link at same time
in space: disjoint pathsin time: time-division multiplexingor a combination
Luca Benini ARTIST2 / UNU IIST 2007
router 1
router 3
router 2networkinterface
networkinterface
networkinterface
1
1
2
1 2
1
3 3
input 2 for router 1 isoutput 1 for router 2
the input routed to the output at this slot
3 3
1
1
2
2
1
1
2
2
1
1
2
2
2
2
1
1
2
2
1
1
4 4
-
-
-
-
o1
-
i1
i1
-
o2
-
-
-
-
o3
-
-
-
-
o4
-
-
i4
-
o1
-
-
-
-
o2
-
-
-
-
o3
i1
-
-
-
o4 Use slots to• avoid contention• divide up bandwidth
-
-
-
-
o1
-
-
-
-
o3
-
i1
-
-
o4
i1
-
-
o2
i3
CFR Example
36
Luca Benini ARTIST2 / UNU IIST 2007
CFR setup
Use best-effort packets to set up connectionsset- up & tear- down packets like in ATM (asynchronous transfer mode)
Distributed, concurrent, pipelinedSafe: always consistentCompute slot assignment compile time, run time,or combinationConnection opening is guaranteed to complete(but without a latency guarantee)with commitment or rejection
Luca Benini ARTIST2 / UNU IIST 2007
Router implementation
Memories (for packet storage)Register- based FIFOs are expensiveRAM- based FIFOs are as expensive
80% of router is memory
Special hardware FIFOs are very useful20% of router is memory
Speed of memoriesregisters are fast enoughRAMs may be too slowHardware FIFOs are fast enough
iqu iqu
iquiqu
switch
iqu
iqu
msu
stu
routers based onregister-file and hardware fifos
drawn to approximatelysame scale (1mm2, 0.26mm2)
37
Luca Benini ARTIST2 / UNU IIST 2007
Layout
…X
BQ
GQ
…
slot table arbiter
reconfiguration logic
programmingpackets
BQ
GQ flowcontrol
datapackets
BQ
GQ
Luca Benini ARTIST2 / UNU IIST 2007
Results
5 input and 5 output ports (arity 5)0.25 mm2 CMOS12500 MHz data path, 166 MHz control pathflit size of 3 words of 32 bits500x32 = 16 Gb/s throughput per link, in each direction256 slots & 5x1 flit fifos for guaranteed- throughput traffic6x8 flit fifos for best- effort traffic
38
Luca Benini ARTIST2 / UNU IIST 2007
xpipes: context
Typical applications targeted by SoCsComplexHighly heterogeneous (component specialization)Communication intensive
xpipes is a synthesizable, heterogeneous NoC infrastructureThree year lifetime, mature research project
University of Bologna (architecture)Stanford University (design technology)University of Cagliari (design and backend)
Task1 Task2 Task4
Task3
SB
Task5
P1(T1) P4(T4)
P3(T3) P5(T5)
NI
NINI
NI
L1
Application mapping(custom, domain-specific)
Luca Benini ARTIST2 / UNU IIST 2007
Heterogeneous topology
SoC component specialization leads to the integration of heterogeneous cores
Ex. MPEG4 Decoder
Non-uniform block sizesSDRAM: communication bottleneckMany neighboring coresdo not communicate
Risk of under-utilizing many tiles and linksRisk of localized congestion
On a homogeneous fabric:
39
Luca Benini ARTIST2 / UNU IIST 2007
Example: MPEG4 decoder
Core graph representation with annotatedaverage communication requirements
Luca Benini ARTIST2 / UNU IIST 2007
NoC Floorplans
General purpose: meshApplication Specific NoC1 (centralized)
Application Specific NoC2 (distributed)
40
Luca Benini ARTIST2 / UNU IIST 2007
Performance, area and power
Relative link utilization(customNoC/meshNoC):1.5, 1.55Relative area(meshNoC/customNoC):1.52, 1.85Relative power(meshNoC/customNoC):1.03, 1.22
Less latency and betterScalability of custom NoCs
Luca Benini ARTIST2 / UNU IIST 2007
Xpipes: features
Source based routingVery high performance switch design
Wormhole switchingMinimize buffering area while reducing latency
Pipelined links Link data introduction interval is not bound by wire delayLink-latency (# of repeater stages) insensitive operation
Parameterizable network building blocksPlug-and-play composable for arbitrary network topologyDesign time tunable buffer size, link width, virtual channels, # of switch I/Os
Standard OCP interface
41
Luca Benini ARTIST2 / UNU IIST 2007
Link delay bottleneck
Wire delay is serious concern for NoC LinksIf NoC “beat” is determined by worst case link delay, performance can be severely limited
Pipeline linksDelay is transformed in LatencyData introduction speed is not bound by link delay any longer!
L
Luca Benini ARTIST2 / UNU IIST 2007
xpipes Architecture:the Network Interface
packeting
BUFF
NoCtopology
unpacketing
packeting
BUFF
BUFF
OCP OCP
OCP clk xpipes clk OCP clk
OCP 2.0 protocol to connect to IP coresPerforms packeting/unpacketingHandles routing via path lookup tables
OCP 2.0 protocol to connect to IP coresPerforms packeting/unpacketingHandles routing via path lookup tables
“Pull” memory channelControl Block keeps programmable table of objects to be moved Table entries can be programmed by different cores Transfer Engine shuffles data among bus and Memory Controller Triggers bus or SDRAM transactions Memory Controller handles SDRAM accesses
Off-chip Memory Interface Unit
Controller Transfer Engine
SDRAM
RA
M
INTE
RC
ON
NEC
T
CO
RE
READ
DATA
CTRL
DATADATA
Luca Benini ARTIST2 / UNU IIST 2007
Summary
Why SoCs?SoC PlatformsFrom SoC to MPSoCFrom MPSoC to NoC