1 Luca Benini ARTIST2 / UNU IIST 2007 Part IV - Design technology for MPSoCs System design & Virtual platforms Analysis of non functional properties: power System optimization Allocation and scheduling Communication synthesis Luca Benini – DEIS Università di Bologna [email protected]Luca Benini ARTIST2 / UNU IIST 2007 Methodology Evolution 70’s Silicon Real Estate Time Complex Layouts Transistors Sea of Transistors Gates 80’s RTL Early 90’s RTL Sea of Gates Late 90’s Platform Based Design TA SW HW Blocks + SW 2005+ Sea of Processors HW SW Multiprocessor Systems on Chip (MPSoC) . . . I$ I$ . . . CORE Engine I$ DSP Platform CORE Engine I$ RISC CONTROL BUS External Memory MMU MEMORY BUS CONTROL BUS IP IP Middleware / OS Application Middleware / OS Application
47
Embed
Part IV - Design technology for MPSoCs - user.it.uu.seuser.it.uu.se/~yi/pdf-files/LucaBeniniSuZhou07/Lec4-DesignTech.pdf · Part IV - Design technology for MPSoCs ... RTL Early 90’s
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Luca Benini ARTIST2 / UNU IIST 2007
Part IV - Design technology for MPSoCs
System design & Virtual platformsAnalysis of non functional properties: powerSystem optimization
Modules (sc_module)Fundamental structural entity Contain processesContain other modules (creating hierarchy)
Ports(sc_in<>,sc_out<>,sc_inout<>)Modules have portsPorts have typesA process can be made sensitive to ports/signals
in1
clk
in2
out1
out2
Luca Benini ARTIST2 / UNU IIST 2007
SC_MODULE
in1
clk
in2
out1
out2
SystemC Classes - ProcessesFunctionality is described in a process
Processes run concurrently
Code inside a process executes sequentially
SystemC has three different types of processesSC_METHOD
SC_THREAD
SC_CTHREADPROCESS
PROCESS
5
Luca Benini ARTIST2 / UNU IIST 2007
Process types
sc_method: method processsensitive to a set of signalsexecuted until it returns
sc_thread: thread processsensitive to a set of signalsexecuted until a wait()
sc_cthread: clocked thread processsensitive only to one edge of clockexecute until a wait() or a wait_until()watching(reset) restarts from top of process body (reset evaluated on active edge)
Combinational
Sequential
Testbench
Luca Benini ARTIST2 / UNU IIST 2007
Execution of processes
Not hierarchical, communicate through signalsExecution and signal updates
request-update semantics1. execute all processes that can be executed2. update the signals written by the processes
other processes to be executed
module ex
port a port binternalsignal
sig
process process
6
Luca Benini ARTIST2 / UNU IIST 2007
Channels
Primitive Hierarchical
Luca Benini ARTIST2 / UNU IIST 2007
Communication semantics
Interface Method Calls (IMC)Process calls an interface method of a channelThe collection of a fixed set of communication Methods is called an Interface (virtual object without data)Channels implement one or more InterfacesModules can be connected via their Ports to those Channels
Time/cycle accurate communication (time constraint) Approximate timed computationProtocol channel provides functions for all abstraction bus transaction
Luca Benini ARTIST2 / UNU IIST 2007
Cycle-accurate computation model
Modeled at register-transfer level PE are pin accurate and execute cycle-accuratelyWrappers convert data transfer from higher level of abstraction to lower level abstraction
9
Luca Benini ARTIST2 / UNU IIST 2007
Successive refinements
Luca Benini ARTIST2 / UNU IIST 2007
Summary: models
Models Communication time
Computation time
Communication scheme
PE interface
Specification model
no no variable (no PE)
Component-assembly model
no approximate variable channel abstract
Bus-arbitration model
approximate approximate abstract bus channel
abstract
Bus-functional model
time/cycle accurate
approximate protocol bus channel
abstract
Cycle-accurate computation
model
approximate cycle-accurate abstract bus channel
pin-accurate
Implementation model
cycle-accurate cycle-accurate bus (wire) pin-accurate
10
Luca Benini ARTIST2 / UNU IIST 2007
Pure SystemC Flow
Luca Benini ARTIST2 / UNU IIST 2007
SystemC HDL Flow
11
Luca Benini ARTIST2 / UNU IIST 2007
The missing link: SystemC Synthesis
SystemC is not “born” to be a language for HW implementation(like Verilog & VHDL)Someone does not think so (and it would be nice if they wereright)
Basic idea: define synthesizableSystemC subsetMake it another refinement step
But will it succeed? Long story…
[Celoxica 2005]
Luca Benini ARTIST2 / UNU IIST 2007
SystemC contrastedwith other design languages
12
Luca Benini ARTIST2 / UNU IIST 2007
Industry Standard
Architecture
Design vs. Reuse
Co-Design
Implement
Co-Verify
Verify
Implement
Verify
SW Code HW Design
Implement
Verify
DesignReuse
IPCodeReuse
IP
Software
Product
Hardware
Environment
Specify
Conceptto
RTL
RTLto
GDSIIReusable
IPIntegration
Reuse of IP components (HW/SW) is key!
Luca Benini ARTIST2 / UNU IIST 2007
Virtual Platforms
Library of functional models of IP BlocksStandardized channel interfaceMultiple levels of abstractions are allowed
13
Luca Benini ARTIST2 / UNU IIST 2007
Example: ARM Prime Xsys VP
Luca Benini ARTIST2 / UNU IIST 2007
HW AcceleratorsDSP
On-chip Memory
DedicatedPeripherals Logic
Core support
CPU Core
AR
B
Dec
odeComplex system interconnect:
Configurable Bus Matrix
Core subsystem:Select and
Automate Integration
Peripheral IP:Select, Configure and Automate Integration
Build full system:Auto-Validate Build
… we need industry standards for data exchange to enable fast VP construction
Building a virtual platform
14
Luca Benini ARTIST2 / UNU IIST 2007
SPIRIT Meta-data:Machine-interpretable design IP Specifies integration requirementsConsistent across all design views
SPIRIT generators:Point-tool launchIP configuration launchInterface for integration with SPIRIT-enabled tools
HW AcceleratorsDSP
AR
B
Dec
ode
On-chip Memory Logic
DedicatedPeripherals
Integrate
SPIRIT a Standard for IP integration
Import
Configure
Core support
CPU
Luca Benini ARTIST2 / UNU IIST 2007
Why use Design Meta-data?
Relate specification to implementationMachine interpretable coupling of design views e.g., Meta-data describes how Verilog signal list of a design IP describes a bus interface
Broad applicabilityIs applicable to new and legacy IPNo enforced design style or methodology A by-product of IP import into SPIRIT-enabled tools
15
Luca Benini ARTIST2 / UNU IIST 2007
SPIRIT in Design Environments
Design Build
Design Capture
protocolbuswidth
mPsystem_bus
ComponentIP
UART GPIO
addressinterfaceregisters
Design Build
protocolbuswidth
mP
system_bus
ComponentIP
UART GPIO
mPComponentIP
UART GPIO
MEM
addressinterfaceregisters
addressinterfaceregisters
SPIRIT IPImportExport
SPIRIT EnabledIP
ComponentIP
ComponentXML
ComponentIP
ComponentXML
SPIRITMeta-data
SPIRIT EnabledSoC Design Tool
ConfiguredIP
PointTool
SPIRITAPIs
PointTool
GeneratorXML
ConfiguratorXML
SPIRIT EnabledGenerators
SoCDesign IP
XML
SoCDesign IP
DesignXML
Luca Benini ARTIST2 / UNU IIST 2007
Analysis of non functional properties: power
16
Luca Benini ARTIST2 / UNU IIST 2007
Non-functional properties
INTERCONNECTION
Core Core INTERRUPTCONTROLLER
PRI MEM 4 SHARED MEM SEMAPHORES
Core Core
PRI MEM 3PRI MEM 2PRI MEM 1
STbusor AMBA or Xpipes
Cycle accurate VP(~ 24 Kcycles/sec with 4 cores on a 2-proc Pentium III, 1GHz, 512MB)
How to estimate power during SW execution?
Luca Benini ARTIST2 / UNU IIST 2007
Power modeling
Invoked from hardware modules after activation events on a cycle-by-cyclebasisEnergy info is passed to data collectorroutine at each cycle
MEMORY(or CACHE)
MODULE
PowerModelEnergy spent
DataCollector
Memory state1. The module calls the
power model function
Energy spent2. The module sends the
energy consumptioninfo to the data collector routines
17
Luca Benini ARTIST2 / UNU IIST 2007
Power model for processor cores
Power statistics are obtaining bymonitoring traces of core execution(e.g. executed instructions)Need to account for idle power when module is stalled
ARMMODULE
PowerModel
Energy spent
DataCollector
1. The simulator calls the data coll. routine
Core state
2. The data collector routine gets the energyinformationfrom the power model
Luca Benini ARTIST2 / UNU IIST 2007
Core Power Estimation: Instruction-Level
ILPA [TMWL96]Empirical method for characterizing single (or very short sequences of) instructions.Key issues:
Evaluation of power dissipation for single instructions.Choice of representative instructions forcharacterization.
Advantage: Roughly architecture-independent.
18
Luca Benini ARTIST2 / UNU IIST 2007
Instruction-Level Power Characterization
Direct measurement of the currents drawnfrom the power supply while executing the instructions.HDL simulation:
The instructions are simulated on a processor model in some HDL.The processor is plugged into a tester machineand simulation traces are applied. The current ismeasured by the tester.
Use simulation of a gate-level description of the processor.
Luca Benini ARTIST2 / UNU IIST 2007
Instruction-Level Models
A power cost is assigned to each instruction.Two components of the cost:
Static component, called “base-cost”: It is the individual instruction cost without a notion of “state”.Dynamic component, called “circuit state effects”: It accounts for the previous processor state.
Dynamic cost accounts for events dependingon sequences of events (e.g., cache misses, pipeline stalls).
19
Luca Benini ARTIST2 / UNU IIST 2007
Extracting the model
The base cost is computed as follows:An infinite loop containing a total of N copies of the target instruction I is executed.The average current is measured as describedearlier.The power cost is obtained from the values of the current, the supply voltage and the cycle/instruction.
N should not be too small to amortize the loop overhead.
Luca Benini ARTIST2 / UNU IIST 2007
Computing program execution cost
Due to the averaging process, the costs for I1 → I2 and I2 →I1 cannot be distinguished.The cost of a program can be summarized as follows:
Cost(Program) = Σi (B i · N i) + Σi j (O i j ·N i j ) + Σ k E k
where: B i : Base cost of instruction i.N i : # of occurrences of instruction i.O i j : Dynamic cost of sequence →j.N i j : # of occurrences of sequence →j.E k: Other effects, obtained from program profiling.
20
Luca Benini ARTIST2 / UNU IIST 2007
Instruction-Level power model: Example
Example of power cost values (expressed in pJ):
Example of computation:
Total value = 5.87pJ/(3·25ns) = 78.26μW (Tc = 25ns)
LOADDLOADADDMULT
2.37 0.17 1.19 0.920.99 0.26 0.531.19 0.66
InstructionName
BaseCost
Circuit State EffectsLOAD DLOAD ADD MULT
1.98 0.13 0.15 1.19 0.92
Total
EvaluationProgram(initial state is ADD) Base Cost Circuit StateDLOAD A←x, B ←y LOAD C←z ADD A←C, B
2.37 1.191.98 0.150.99 1.193.34 2.53
Luca Benini ARTIST2 / UNU IIST 2007
Micro-architectural Power Model
The processor is viewed as an interconnection of macro blocks
E.g. Execution units, register file, etc.
Power models are built for the macrosE.g. Analytical, look-up tables, etc.
Advantage: allows micro-architecture expl.Disadvantage: no black-box for COTS proc.
21
Luca Benini ARTIST2 / UNU IIST 2007
FPLA : Functional Level Power Analysis
Between ILPA and micro-architecturalLess parameters than ILPA, less info on intenals than micro-acrchitectural
Suitable for complex cores, with limited internal informationAlgorithmic parameters require functional simulation (ISS runor code analysis)
The handler can be easily modified to be invoked by a pseudo-hardwaremodule for collection of system power statistics
Luca Benini ARTIST2 / UNU IIST 2007
Power profiling: SW view
Power distributions for send Power distributions for receive
Message size:128 byte
Message size:256 byte
23
Luca Benini ARTIST2 / UNU IIST 2007
System optimizationAllocation and scheduling
Design as optimizationDesign spaceThe set of “all” possible design choicesConstraintsSolutions that we are not willing to
acceptCost functionA property we are interested in
(execution time, power, reliability…)
24
Hardware synthesisALGORITHM
HIGH-LEVEL SYNTHESIS
S1 S3 S4S2
0.0 200.0 4 00.0 600. 0Freq
-120 .0
-100 .0
-80 .0
-60 .0
-40 .0
-20 .0
Am
pl (
db)
++
++
D
D
++
++
D
D
c1 c2
c3
c4 c5
c6
kIN
+
+
D
D
++
+
D
D
+
++c1
c2 c3
c4
c5
c6 c7
c8
k
dIN OUT
APPLICATION
interconnect
ASICGP signal
MCM
processor
memory
ARCHITECTURE
LOGIC AND PHYSICAL SYNTHESIS
Behavioral synthesisC ontrol/D ataFlow G rap h
(C DFG )Implem en tation
RegReg
M ultiplier
Adder
RegReg2 1 1 ...2 3 2 ...
4 3 2 ...
0 4 7 ...4 7 9 ...
25
Allocation, Assignment, and Scheduling
D
+
-
>>
>>
+
-
>>
+ >>
+
>>
+
Allocation: How Much?2 adders
Assignment: Where?
Schedule: When?
Shifter 1
Time Slot 4
1 shifter24 registers
D
Techniques Well Understood and Mature
Luca Benini ARTIST2 / UNU IIST 2007
Application Mapping
The problem of allocating, scheduling for task graphs on multi-processors in a distributed real-time system is NP-hard.New tool flows for efficient mapping of multi-task applications onto hardware platforms
T1
T2 T3
T4 T5 T6
T7
T8
…Proc. 1 Proc. 2 Proc. N
INTERCONNECT
Private
Mem
Private
Mem
Private
Mem…
T1 T2 T3T4 T5 T6T8 T7
Time
Res
ourc
es
T1 T2
T3
T4
T5 T7
Deadline
T8
Allocation
Schedule
26
Luca Benini ARTIST2 / UNU IIST 2007
When & Why Offline Optimization?
Plenty of design-time knowledgeApplications pre-characterized at design timeDynamic transitions between different pre-characterized scenarios
Aggressive exploitation of system resourcesReduces overdesign (lowers cost)Strong performance guarantees
Applicable for many embedded applications
Luca Benini ARTIST2 / UNU IIST 2007
Scheduling & Voltage Scaling
deadlinet
P
τ1 τ2 τ3
Energy/speed trade-offs:varying the voltages
Vbs
CPUVdd
f1 f2 f3
Different voltages:different frequencies
Mapping and scheduling: given (fastest freq.)
Power
deadlinetτ1 τ2 τ3
SlackVoltage and Frequency scalingmake the problem even harder!
Current off-line approachessolve mapping, scheduling and voltage
ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);
AMBA AHB;DMA engine;RTEMS OS;Power models for 0.13μm power models (STM)
Variable Voltage/Frequency cores with discrete (Vdd,f) pairsFrequency dividers scale down the baseline 200 MHz system clockCores use non-cacheable shared memory to communicateSemaphore and interrupt facilities are used for synchronization
Tile TileTile Tile …Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
SystemC
LOC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CLKTile TileTile Tile …
Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
SystemC
LOC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CLK
Luca Benini ARTIST2 / UNU IIST 2007
Task graphA group of tasks TTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation or computed using WCET analysis tools (e.g. AbsINT)Node type
Normal; Fork, And; Branch, Or
Application model
Task1
Task2
Task3
Task4
Task5
Task6
WCN(WT1T2)WCN(RT1T2)WCN(T1)
WCN(WT1T3)WCN(RT1T3)
WCN(T2) WCN(WT2T4)WCN(RT2T4)
WCN(WT3T5)WCN(RT3T5)
WCN(WT4T6)WCN(RT4T6)
WCN(WT5T6)WCN(RT5T6)
WCN(T3)
WCN(T4)
WCN(T5)
WCN(T6)
28
Luca Benini ARTIST2 / UNU IIST 2007
Syst
em B
us
Priv
ate
Mem
Priv
ate
Mem
ARM Core
Int controller
SPM
Semaphores
ARM Core
Int controller
Semaphores
SPM
#2#1
Task memory requirements
Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure
Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory
Each task has three kinds of memory requirements
Program DataInternal StateCommunication queues
Luca Benini ARTIST2 / UNU IIST 2007
Syst
em B
us
Priv
ate
Mem
Priv
ate
Mem
ARM Core
Int controller
SPM
Semaphores
ARM Core
Int controller
Semaphores
SPM
Task memory requirements
Each task has three kinds of memory requirements:
Program Data;Internal State;Communication queues.
#2
#1
Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure
Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory
General modeling framework highly unstructured optimization problems
No black-box/generic optimizer can solve them efficientlyWe developed a flexible algorithmic frameworkwich is tuned on specific problems
30
Luca Benini ARTIST2 / UNU IIST 2007
Logic Based Benders DecompositionObj. Function:Communication cost
& energy consumption
Validallocation
Allocation& Freq. Assign.:
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good: linearconstraint
Memory constraints
Timingconstraint
Decomposes the problem into 2 sub-problems:Allocation & Assignment (& freq. setting) → IP
Objective Function: E.g.: minimizing energy consumption during execution and communication of tasks
Scheduling → CPObjective Function: E.g.: minimizing energy consumption during frequency switching
Luca Benini ARTIST2 / UNU IIST 2007
Computational scalability
Simplified CP and IP formulationsHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 1000 sec.
CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution
Deterministic task graphs, mapping & scheduling
16 25 36 49 64 81 100 1 2 3 4 5 6 7
31
Luca Benini ARTIST2 / UNU IIST 2007
Computational Scalability
Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability
Deterministic task graphs, mapping & scheduling & v,f selectionStochastic task graphs, mapping & scheduling & min bus usage
Luca Benini ARTIST2 / UNU IIST 2007
Optimality gapComparison with heuristic 2-phase solution (GA)
“timing barrier”
gap significant when constraints are tight
32
Luca Benini ARTIST2 / UNU IIST 2007
Optimization Development
The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.Programmers must be conscious about simplified assumptions taken into account in optimization tools.
Platform Modelling
Optimization Analysis
Optimal Solution
Starting Implementation
Platform Execution
Abstractiongap
(. .
Final Implementation
Challenge: the Abstraction Gap
Luca Benini ARTIST2 / UNU IIST 2007
MAX error lower than 10%AVG error equal to 4.51%, with standard deviation of 1.94All deadlines are met
Throughput required: 1 frame/10ms.With 2 processors and 4 possible frequency & voltage settings:
Task Graph:10 computational tasks;15 communication tasks.
Without optimizations:50.9μJ
With optimizations:17.1 μJ - 66,4%
34
Luca Benini ARTIST2 / UNU IIST 2007
Challenge: programming environment
A software development toolkit to help programmers in software implementation:
a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT in RT-OS (RTEMS)
The main goals are:predictable application execution after the optimization step;guarantees on high performance and constraint satisfaction.
Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmers can intuitively translate high level representation into C-code using our facilities and library
ExampleNumber of nodes (e.g 12)Graph of activitiesNode type
Normal, Branch, Conditional, Terminator
Node behaviourOr, And, Fork, Branch
Number of CPUs : 2Task AllocationTask SchedulingArc priorities
Time
Res
ourc
es
N1 B2
B3
C4
C7
Deadline
N8
T2 T3
T4 T5 T6 T7
T8 T9 T10
T11
T12
T1N1
B2 B3
C4 C5 C6 C7
N8 N9 N10
N11
T12
fork
or
or
and
branch branch
P1P2
N11
N10
T12
a1a2
a3 a4 a5 a6
a7 a8 a9 a10
a11 a12
B3 C7 N10
T12
a13
a14
#define TASK_NUMBER 12
35
Luca Benini ARTIST2 / UNU IIST 2007
Relationship with RT techniques
We can handle periodic task graphsMultiple rates can be analyzed by unrolling and periodic extension
Cannot deal with aperiodic/sporadic tasks unknown at design time
They would require unbounded unrollingCurrently assumes non-preemptive scheduling
Luca Benini ARTIST2 / UNU IIST 2007
Looking forward
Toward a mature SDKMature programmer support (Eclipse toolkit, OpenMAX support)Extend semantics (multi-rate SDF)Ports on real platforms (Cell BC underway, Nomadik is under discussion)
Optimization engine enhancementsDealing with multiple use casesVariable execution timesAggressive communication scheduling on NoCsAddress preemption and sporadic tasks
36
Luca Benini ARTIST2 / UNU IIST 2007
System optimizationAllocation and schedulingCommunication synthesis
Luca Benini ARTIST2 / UNU IIST 2007
Data Communication vs. Processing
I/O Bus
Main Bus
Core NµP
Core 2
µP Sub systemµP
Mem Bus
Core 1 SoCs
Circa 2004
SoCs Circa 2007
Critical Decision Was uP Choice
Critical Decision Is Interconnect Choice
Communication Architecture Design and Verification becoming Highest Priority in Contemporary SoC Design!
DRAMC
Exploding core counts requiring more advanced InterconnectsEDA cannot solve this architectural problem easilyComplexity too high to hand craft (and verify!)
Source: SONICS Inc.
37
Luca Benini ARTIST2 / UNU IIST 2007
Communication Architectures in today’s complex systems significantly affect performance, power, cost and time- to- market!
Communication Architectures in today’s complex systems significantly affect performance, power, cost and time- to- market!
communication architecture consumes upto 50% of total
on-chip power!
communication is THE most critical aspect affecting system performance
communication architecture design, customization,
exploration, verification and implementation takes up the
largest chunk of a design cycle
ever increasing number of wires, repeaters, bus components
(arbiters, bridges, decoders etc.) increases system cost
Need for Communication-centric Design Flow
Luca Benini ARTIST2 / UNU IIST 2007
Typical Industrial SoC Design Flow
algorithm selection
optimization
algorithm selection
optimizationfunctional model
HW/SW partitioningbehavior mapping
architecture exploration
HW/SW partitioningbehavior mapping
architecture exploration architecture model
communication model
implementation model
application requirements
CA selection/explorationprotocol generationtopology synthesis
CA selection/explorationprotocol generationtopology synthesis
interface synthesis
cycle scheduling
interface synthesis
cycle scheduling
Logic synthesis and physical implementation
ad-hoc partitioning and mapping; “on-paper” exploration
manual communication architecture selection
manual bus<->IP interface synthesis
very limited RTL exploration (~weeks)verification (~months)
physical implementation (~months)
no algorithm optimization
38
Luca Benini ARTIST2 / UNU IIST 2007
Physically Aware Bus Topology and Parameter Synthesis
Pasricha et al. [DAC 2005] presented the FABSYN approach, whichautomatically synthesizes bus topology and parameters (arbitration schemes, bus widths, bus speeds, DMA burst size)automatically detects and eliminate bus cycle timing violations during synthesis
Increasingly important in DSM era as clock speeds increase and lengthy propagation delays cause large number of timing violationsSaves costly design iterations during physical implementation
S1S1
S3S3
S2S2
MEM3MEM3M3M3
MEM2MEM2
M2M2
CPU1CPU1
MEM1MEM1
S4S4
M2M2
CPU1CPU1
S1S1
MEM3MEM3
MEM2aMEM2a
S3S3
S2S2
periphmain1
bridgebridge
MEM1MEM1 S4S4
MEM2bMEM2b
main2
M3M3
bridge bridge
bridge bridge
main3
bridgebridge
AutomatedBus Architecture
Synthesis
Floorplan and Wire Delay Estimation EnginePasricha et al. [DAC 2005]
Luca Benini ARTIST2 / UNU IIST 2007
FABSYN Synthesis Flow
CTGCTG
commarch.
commarch.
constraintSet (Ψ)
constraintSet (Ψ)
preprocesspreprocess
simple bus mapping
simple bus mapping
explore_paramsexplore_params
TCP met?
TCP met? mutate_topologymutate_topology
optimize_designoptimize_design
output synthesized communication archoutput synthesized
communication arch
IP library
IP library
Select unsatisfied TCP from Ω
Select unsatisfied TCP from Ω
Ω empty?Ω empty?
Run floorplannerand delay estimatorRun floorplanner
and delay estimator
Ω stillempty?Ω still
empty?
no
yes
no
yes
no
yes
Inputs Output
CTG or Communication Throughput Graphincorporates SoC IPs (nodes) and their interconnections (vertices)
TCP or Throughput Constraint Path is a CTG sub-graph representing constraint to be satisfied;Ω is a superset of all TCPs
Ψ or Communication Parameter Constraint Set is a discrete set of valid values for BA parameters, to ensure realistic output
ChallengesComplexity (e.g. 4-SHB + 2XBar, 75 actors): how to analyze and verify “spaghetti interconnects”?Scalability: bus is bandwidth-limited, Xbar is size-limitedPredictability: how to tie interconnects with floorplanning
AHB
AHB
AHB
40
Luca Benini ARTIST2 / UNU IIST 2007
The Network-on-Chip Paradigm
DSPNI
NIDRAM
switch
DMANI
CPU NI
NIAccelNI MPEG
switch
switch
switch
NoC
switch
switch
The “power of NoCs”:Clean separation at session layer
Cores issue end-to-end transactionsNetwork deals with transport, network, link, physical
Modularity at HW level: only2 building blocks
Network interfaceSwitch (router)
Physical design aware (floorplanglobal routing)
Scalability is supported from the ground up!
Luca Benini ARTIST2 / UNU IIST 2007
NoC Synthesis Project
SunFloor
TopologySynthesis
includes:FloorplannerNoC Router RTL
ArchitecturalSimulation
PlatformGeneration
Constraint graphComm graph
NoCArea models
Systemspecs
SystemCcode
NoCcomponent
libraryFPGA
Emulation
To fab
PlatformGeneration
(xpipes-Compiler)
Synthesis
Userobjectives:
power,hop delay
NoCPower models
Constraints:area, power,hop delay,wire length
IP Coremodels
Placement&Routing
Codesign,
Simulation
Application
Input trafficmodel
Area, power characterization
Started in 2002UNIBO, UNICA, Stanford, EPFLObjective: develop a complete EDA flow for NoC synthesis from application to P&R
Backend flow
Floorplanning specifications
41
Luca Benini ARTIST2 / UNU IIST 2007
The xpipes NoC
Packeting/unpacketingOCP 2.0 protocol to connect to IP coresSource routingDual Clock 2 Stage Pipeline
ObjectivesDesign fully application-specific custom topologiesGenerate deadlock-free networks: both routing and message-level deadlocks are removedOptimize architectural parameters of the NoC (frequency, flit size), tuning based upon application requirements
Leverage accurate analytical models for area and power, back-annotated from layoutsIntegrated floorplanner to achieve design closurewhile also considering wiring complexity
Physical design awareness
Luca Benini ARTIST2 / UNU IIST 2007
Vary NoCarchitecturalparameters
frequency, data-width
Bandwidth, power consumption varies
SUNFLOOR Steps
Vary numberof switches
Which is better ? – Do not know !!
44
Luca Benini ARTIST2 / UNU IIST 2007
NP-Hard problem (single path multi-commodity flow)
Use fast & efficient heuristics
Vary numberof switches
Vary NoCarchitecturalparameters
frequency, data-width
SUNFLOOR Steps
Synthesize best topology
Luca Benini ARTIST2 / UNU IIST 2007
SUNFLOOR Steps
Vary NoCarchitecturalparameters
frequency, data-width
Synthesize best topology
Perform floorplan of design Calculate timing characteristics
If design constraintsmet, save solution
Choose most efficientsolution satisfying
all design constraints
Vary numberof switches
45
Luca Benini ARTIST2 / UNU IIST 2007
vld rld iqn
vprpad
smm
70
27
357
362
49
313
94500
353
300
16
Core graph
isn ups
313
idctarm
a-d
vpm
Synthesis Algorithm
Obtain min-cut partitions of core-graph
Cores in a partition share a switch
Find lower bound on switch sizes, switch power
Establish switch connectivity by routing flows
Account for constraints on # hops, deadlock avoidance, switch sizeMinimize power
Refer to Murali et al. ICCAD06 for full details
Luca Benini ARTIST2 / UNU IIST 2007
Processor-memory cluster
Case Study 1: Comparison AgainstHand-Mapped Topology
Benchmark execution time comply with application requirements and are even 10% better on SunFloor topology.
constraint
Luca Benini ARTIST2 / UNU IIST 2007
1.152.002.00
20.5390.1738.60
CustomMesh
Opt-mesh
MWD(12 cores)
1.332.002.00
30.0095.9446.48
CustomMesh
Opt-mesh
VOPD(12 cores)
1.502.172.17
27.2496.8260.97
CustomMesh
Opt-mesh
MPEG4(12 cores)
1.672.582.58
79.64301.8136.1
CustomMesh
Opt-mesh
VPROC(42 cores)
Avg. nr. hopsPower(mW)TopologyApplication
Case Study 2: SUNFLOOR Vs Regular Topologies
On average, SunFloor custom topologies:
2.75x less power consumption
1.55x less hop delay
Despite large design space, maximum run time of 1 hour for VPROC
47
Luca Benini ARTIST2 / UNU IIST 2007
Looking Forward
Quality of service guarantees for critical trafficRun-time configurabilityRobustness w.r.t. to static/dynamic variations, errorsNetwork interfaces: interoperability, performance
Luca Benini ARTIST2 / UNU IIST 2007
Summary
MPSoC design technology is in fast evolutionSupport for functional design is reaching industrialmaturity
Virtual platformsIP reuse standardization
Support for analysis of non-functional properties isimmature
Even functional analysis is only simulation-basedSystem optimization is at research stage