Part IV - Design technology for MPSoCs - user.it.uu.seuser.it.uu.se/~yi/pdf-files/LucaBeniniSuZhou07/Lec4-DesignTech.pdf · Part IV - Design technology for MPSoCs ... RTL Early 90’s

1

Luca Benini ARTIST2 / UNU IIST 2007

Part IV - Design technology for MPSoCs

System design & Virtual platformsAnalysis of non functional properties: powerSystem optimization

Allocation and schedulingCommunication synthesis

Luca Benini – DEIS Università di [email protected]


Methodology Evolution

70’s

Silic

on R

eal E

stat

e

Time

Complex Layouts

Transistors

Sea of Transistors

Gates

80’s

RTL

Early 90’s

RTL

Sea of Gates

Late 90’s

Platform BasedDesign

TA SW

HW Blocks + SW

2005+

Sea of Processors

HW SW

Multiprocessor Systems on Chip

(MPSoC)

.

.

.

COREEngine

D$I$

DSP

PlatformIP CORE

EngineD$I$

RISC

CONTROL BUS

Exte

rnal

Mem

ory

MM

U

MEMORY BUS

CONTROL BUS

IP

IP

IP

IP

IP

.

.

.

COREEngine

D$I$

DSP

PlatformIP CORE

EngineD$I$

RISC

CONTROL BUS

Exte

rnal

Mem

ory

MM

U

MEMORY BUS

CONTROL BUS

IP

IP

IP

IP

IP

Middlew

are / OS

Application

Middlew

are / OS

Application

2


Platform based SoC Design Flow

algorithm selectionoptimization

algorithm selectionoptimization

functional modelHW/SW partitioning

behavior mappingarchitecture exploration

HW/SW partitioningbehavior mapping

architecture exploration

architecture modelCPU IP

IP

CPU

MM

S S

INPUT OUTPUT

communication model

implementation model

application requirements

CA selection/explorationprotocol generationtopology synthesis


interface synthesiscycle scheduling

interface synthesiscycle scheduling

CPU

CPU S

Logic synthesis and physical implementation

M

IP

IP M

S

CPU

CPU S

M

IP

IP M

S


SystemC

Objectives:Model Hw with Sw programming languageAchieve fast simulationProvide support for hw/sw system design

Requirement:Give hw semantics to sw models

Supported by a large consortium of semiconductor and EDA companies

3


SystemC Design Vision

SystemC as a single design language

System Specification(SystemC)

HW(SystemC)

SW(SystemC)

Testbench

0101011110100010111001010010011110000111101010010001100110101011. . .

Ref

ine

Ref

ine


SystemC Model Structure

4


SC_MODULE

SystemC ClassesModules and Ports

Modules (sc_module)Fundamental structural entity Contain processesContain other modules (creating hierarchy)

Ports(sc_in<>,sc_out<>,sc_inout<>)Modules have portsPorts have typesA process can be made sensitive to ports/signals

in1

clk

in2

out1

out2


SC_MODULE

in1

clk

in2

out1

out2

SystemC Classes - ProcessesFunctionality is described in a process

Processes run concurrently

Code inside a process executes sequentially

SystemC has three different types of processesSC_METHOD

SC_THREAD

SC_CTHREADPROCESS

PROCESS

5


Process types

sc_method: method processsensitive to a set of signalsexecuted until it returns

sc_thread: thread processsensitive to a set of signalsexecuted until a wait()

sc_cthread: clocked thread processsensitive only to one edge of clockexecute until a wait() or a wait_until()watching(reset) restarts from top of process body (reset evaluated on active edge)

Combinational

Sequential

Testbench


Execution of processes

Not hierarchical, communicate through signalsExecution and signal updates

request-update semantics1. execute all processes that can be executed2. update the signals written by the processes

other processes to be executed

module ex

port a port binternalsignal

sig

process process

6


Channels

Primitive Hierarchical


Communication semantics

Interface Method Calls (IMC)Process calls an interface method of a channelThe collection of a fixed set of communication Methods is called an Interface (virtual object without data)Channels implement one or more InterfacesModules can be connected via their Ports to those Channels

7


1. Specification model

2. PE*-assembly model3. Bus-arbitration model4. Time-accurate

communication model

5. Cycle-accurate computation model

6. Implementation model

Model types

TLM

* Processing elements


PE-assembly & Bus-arbitration Models

Processing elements (PEs)Message-passing channels

Abstract bus channelsBus arbiter arbitrates bus

conflict

8


Time-accurate Communication model

Time/cycle accurate communication (time constraint) Approximate timed computationProtocol channel provides functions for all abstraction bus transaction


Cycle-accurate computation model

Modeled at register-transfer level PE are pin accurate and execute cycle-accuratelyWrappers convert data transfer from higher level of abstraction to lower level abstraction

9


Successive refinements


Summary: models

Models Communication time

Computation time

Communication scheme

PE interface

Specification model

no no variable (no PE)

Component-assembly model

no approximate variable channel abstract

Bus-arbitration model

approximate approximate abstract bus channel

abstract

Bus-functional model

time/cycle accurate

approximate protocol bus channel

abstract

Cycle-accurate computation

model

approximate cycle-accurate abstract bus channel

pin-accurate

Implementation model

cycle-accurate cycle-accurate bus (wire) pin-accurate

10


Pure SystemC Flow


SystemC HDL Flow

11


The missing link: SystemC Synthesis

SystemC is not “born” to be a language for HW implementation(like Verilog & VHDL)Someone does not think so (and it would be nice if they wereright)

Basic idea: define synthesizableSystemC subsetMake it another refinement step

But will it succeed? Long story…

[Celoxica 2005]


SystemC contrastedwith other design languages

12


Industry Standard

Architecture

Design vs. Reuse

Co-Design

Implement

Co-Verify

Verify

Implement

Verify

SW Code HW Design

Implement

Verify

DesignReuse

IPCodeReuse

IP

Software

Product

Hardware

Environment

Specify

Conceptto

RTL

RTLto

GDSIIReusable

IPIntegration

Reuse of IP components (HW/SW) is key!


Virtual Platforms

Library of functional models of IP BlocksStandardized channel interfaceMultiple levels of abstractions are allowed

13


Example: ARM Prime Xsys VP


HW AcceleratorsDSP

On-chip Memory

DedicatedPeripherals Logic

Core support

CPU Core

AR

B

Dec

odeComplex system interconnect:

Configurable Bus Matrix

Core subsystem:Select and

Automate Integration

Peripheral IP:Select, Configure and Automate Integration

Build full system:Auto-Validate Build

… we need industry standards for data exchange to enable fast VP construction

Building a virtual platform

14


SPIRIT Meta-data:Machine-interpretable design IP Specifies integration requirementsConsistent across all design views

SPIRIT generators:Point-tool launchIP configuration launchInterface for integration with SPIRIT-enabled tools

HW AcceleratorsDSP

AR

B

Dec

ode

On-chip Memory Logic

DedicatedPeripherals

Integrate

SPIRIT a Standard for IP integration

Import

Configure

Core support

CPU


Why use Design Meta-data?

Relate specification to implementationMachine interpretable coupling of design views e.g., Meta-data describes how Verilog signal list of a design IP describes a bus interface

Broad applicabilityIs applicable to new and legacy IPNo enforced design style or methodology A by-product of IP import into SPIRIT-enabled tools

15


SPIRIT in Design Environments

Design Build

Design Capture

protocolbuswidth

mPsystem_bus

ComponentIP

UART GPIO

addressinterfaceregisters

Design Build

protocolbuswidth

mP

system_bus

ComponentIP

UART GPIO

mPComponentIP

UART GPIO

MEM



SPIRIT IPImportExport

SPIRIT EnabledIP

ComponentIP

ComponentXML

ComponentIP

ComponentXML

SPIRITMeta-data

SPIRIT EnabledSoC Design Tool

ConfiguredIP

PointTool

SPIRITAPIs

PointTool

GeneratorXML

ConfiguratorXML

SPIRIT EnabledGenerators

SoCDesign IP

XML

SoCDesign IP

DesignXML


Analysis of non functional properties: power

16


Non-functional properties

INTERCONNECTION

Core Core INTERRUPTCONTROLLER

PRI MEM 4 SHARED MEM SEMAPHORES

Core Core

PRI MEM 3PRI MEM 2PRI MEM 1

STbusor AMBA or Xpipes

Cycle accurate VP(~ 24 Kcycles/sec with 4 cores on a 2-proc Pentium III, 1GHz, 512MB)

How to estimate power during SW execution?


Power modeling

Invoked from hardware modules after activation events on a cycle-by-cyclebasisEnergy info is passed to data collectorroutine at each cycle

MEMORY(or CACHE)

MODULE

PowerModelEnergy spent

DataCollector

Memory state1. The module calls the

power model function

Energy spent2. The module sends the

energy consumptioninfo to the data collector routines

17


Power model for processor cores

Power statistics are obtaining bymonitoring traces of core execution(e.g. executed instructions)Need to account for idle power when module is stalled

ARMMODULE

PowerModel

Energy spent

DataCollector

1. The simulator calls the data coll. routine

Core state

2. The data collector routine gets the energyinformationfrom the power model


Core Power Estimation: Instruction-Level

ILPA [TMWL96]Empirical method for characterizing single (or very short sequences of) instructions.Key issues:

Evaluation of power dissipation for single instructions.Choice of representative instructions forcharacterization.

Advantage: Roughly architecture-independent.

18


Instruction-Level Power Characterization

Direct measurement of the currents drawnfrom the power supply while executing the instructions.HDL simulation:

The instructions are simulated on a processor model in some HDL.The processor is plugged into a tester machineand simulation traces are applied. The current ismeasured by the tester.

Use simulation of a gate-level description of the processor.


Instruction-Level Models

A power cost is assigned to each instruction.Two components of the cost:

Static component, called “base-cost”: It is the individual instruction cost without a notion of “state”.Dynamic component, called “circuit state effects”: It accounts for the previous processor state.

Dynamic cost accounts for events dependingon sequences of events (e.g., cache misses, pipeline stalls).

19


Extracting the model

The base cost is computed as follows:An infinite loop containing a total of N copies of the target instruction I is executed.The average current is measured as describedearlier.The power cost is obtained from the values of the current, the supply voltage and the cycle/instruction.

N should not be too small to amortize the loop overhead.


Computing program execution cost

Due to the averaging process, the costs for I1 → I2 and I2 →I1 cannot be distinguished.The cost of a program can be summarized as follows:

Cost(Program) = Σi (B i · N i) + Σi j (O i j ·N i j ) + Σ k E k

where: B i : Base cost of instruction i.N i : # of occurrences of instruction i.O i j : Dynamic cost of sequence →j.N i j : # of occurrences of sequence →j.E k: Other effects, obtained from program profiling.

20


Instruction-Level power model: Example

Example of power cost values (expressed in pJ):

Example of computation:

Total value = 5.87pJ/(3·25ns) = 78.26μW (Tc = 25ns)

LOADDLOADADDMULT

2.37 0.17 1.19 0.920.99 0.26 0.531.19 0.66

InstructionName

BaseCost

Circuit State EffectsLOAD DLOAD ADD MULT

1.98 0.13 0.15 1.19 0.92

Total

EvaluationProgram(initial state is ADD) Base Cost Circuit StateDLOAD A←x, B ←y LOAD C←z ADD A←C, B

2.37 1.191.98 0.150.99 1.193.34 2.53


Micro-architectural Power Model

The processor is viewed as an interconnection of macro blocks

E.g. Execution units, register file, etc.

Power models are built for the macrosE.g. Analytical, look-up tables, etc.

Advantage: allows micro-architecture expl.Disadvantage: no black-box for COTS proc.

21


FPLA : Functional Level Power Analysis

Between ILPA and micro-architecturalLess parameters than ILPA, less info on intenals than micro-acrchitectural

Suitable for complex cores, with limited internal informationAlgorithmic parameters require functional simulation (ISS runor code analysis)

Algorithmic parameters• α: parallelism rate• β: processing rate• γ: ext. IM access rate• ε: DMA activity rate• τ: ext. DM access rate

Architectural parameters• F: clock frequency• MM: internal Mem mode

(mapped,bypass,cache,freeze)

• DD: data mapping• DW: DMA data width

[Laurent03]

(example TI62, TI67 DSPs)


Power profiling: HW view

Waveforms: cycle by cycle consumption

Power estimation----------------

Energy spent:ARM 0

core: 25609147.30 [pJ]cache: 105048808.17 [pJ]

ARM 1core: 25609092.30 [pJ]cache: 105048808.17 [pJ]



RAM 0: 2825183.87 [pJ]RAM 1: 2825183.87 [pJ]RAM 2: 2825183.87 [pJ]RAM 3: 2824958.26 [pJ]RAM 4: 0.00 [pJ]BUS: 50778876.39 [pJ]

Power spent:ARM 0

core: 51.18 [mW]cache: 209.95 [mW]

ARM 1core: 51.18 [mW]cache: 209.95 [mW]



RAM 0: 5.65 [mW]RAM 1: 5.65 [mW]RAM 2: 5.65 [mW]RAM 3: 5.65 [mW]RAM 4: 0.00 [mW]BUS: 101.49 [mW]

Output file: totals

22


Using power models in SW

ISS core SWI_METRIC_START

Initialization:...RegisterSWI(SWI_METRIC_START,metric_start_swi_call);...

installs the handleruint32_t metric_start_swi_call(

CArmProc *arm, uint32_t r0, uint32_t r1, uint32_t r2, uint32_t r3)

statobject->startMeasuring(arm->ID);return r0;

......__asm ("swi " SWI_METRIC_STARTstr);......

Program:

handler invocation

The handler can be easily modified to be invoked by a pseudo-hardwaremodule for collection of system power statistics


Power profiling: SW view

Power distributions for send Power distributions for receive

Message size:128 byte

Message size:256 byte

23


System optimizationAllocation and scheduling

Design as optimizationDesign spaceThe set of “all” possible design choicesConstraintsSolutions that we are not willing to

acceptCost functionA property we are interested in

(execution time, power, reliability…)

24

Hardware synthesisALGORITHM

HIGH-LEVEL SYNTHESIS

S1 S3 S4S2

0.0 200.0 4 00.0 600. 0Freq

-120 .0

-100 .0

-80 .0

-60 .0

-40 .0

-20 .0

Am

pl (

db)

++

++

D

D

++

++

D

D

c1 c2

c3

c4 c5

c6

kIN

+

+

D

D

++

+

D

D

+

++c1

c2 c3

c4

c5

c6 c7

c8

k

dIN OUT

APPLICATION

interconnect

ASICGP signal

MCM

processor

memory

ARCHITECTURE

LOGIC AND PHYSICAL SYNTHESIS

Behavioral synthesisC ontrol/D ataFlow G rap h

(C DFG )Implem en tation

RegReg

M ultiplier

Adder

RegReg2 1 1 ...2 3 2 ...

4 3 2 ...

0 4 7 ...4 7 9 ...

25

Allocation, Assignment, and Scheduling

D

+

-

>>

>>

+

-

>>

+ >>

+

>>

+

Allocation: How Much?2 adders

Assignment: Where?

Schedule: When?

Shifter 1

Time Slot 4

1 shifter24 registers

D

Techniques Well Understood and Mature


Application Mapping

The problem of allocating, scheduling for task graphs on multi-processors in a distributed real-time system is NP-hard.New tool flows for efficient mapping of multi-task applications onto hardware platforms

T1

T2 T3

T4 T5 T6

T7

T8

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem

Private

Mem

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Time

Res

ourc

es

T1 T2

T3

T4

T5 T7

Deadline

T8

Allocation

Schedule

26


When & Why Offline Optimization?

Plenty of design-time knowledgeApplications pre-characterized at design timeDynamic transitions between different pre-characterized scenarios

Aggressive exploitation of system resourcesReduces overdesign (lowers cost)Strong performance guarantees

Applicable for many embedded applications


Scheduling & Voltage Scaling

deadlinet

P

τ1 τ2 τ3

Energy/speed trade-offs:varying the voltages

Vbs

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given (fastest freq.)

Power

deadlinetτ1 τ2 τ3

SlackVoltage and Frequency scalingmake the problem even harder!

Current off-line approachessolve mapping, scheduling and voltage

selection separately (sequentially)

27


Target architecture Homogeneous computation tiles:

ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB;DMA engine;RTEMS OS;Power models for 0.13μm power models (STM)

Variable Voltage/Frequency cores with discrete (Vdd,f) pairsFrequency dividers scale down the baseline 200 MHz system clockCores use non-cacheable shared memory to communicateSemaphore and interrupt facilities are used for synchronization

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

LOC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLKTile TileTile Tile …

Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

LOC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLK


Task graphA group of tasks TTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation or computed using WCET analysis tools (e.g. AbsINT)Node type

Normal; Fork, And; Branch, Or

Application model

Task1

Task2

Task3

Task4

Task5

Task6

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)




WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

28


Syst

em B

us

Priv

ate

Mem

Priv

ate

Mem

ARM Core

Int controller

SPM

Semaphores

ARM Core

Int controller

Semaphores

SPM

#2#1

Task memory requirements

Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory

Each task has three kinds of memory requirements

Program DataInternal StateCommunication queues


Syst

em B

us

Priv

ate

Mem

Priv

ate

Mem

ARM Core

Int controller

SPM

Semaphores

ARM Core

Int controller

Semaphores

SPM

Task memory requirements

Each task has three kinds of memory requirements:

Program Data;Internal State;Communication queues.

#2

#1

Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory

29


Application Development Flow

CTGCharacterization

Phase

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

ImplementationAllo

catio

n

Sched

uling

ApplicationDevelopment

Support

PlatformExecution


Optimization frameworkDeterministic & stochastic task graphsConstraints

Resources: computation, communication, storageTiming: task deadlines, makespan

Objective functionsPerformance (e.g. Makespan)Power (energy)Bus utilization

General modeling framework highly unstructured optimization problems

No black-box/generic optimizer can solve them efficientlyWe developed a flexible algorithmic frameworkwich is tuned on specific problems

30


Logic Based Benders DecompositionObj. Function:Communication cost

& energy consumption

Validallocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Timingconstraint

Decomposes the problem into 2 sub-problems:Allocation & Assignment (& freq. setting) → IP

Objective Function: E.g.: minimizing energy consumption during execution and communication of tasks

Scheduling → CPObjective Function: E.g.: minimizing energy consumption during frequency switching


Computational scalability

Simplified CP and IP formulationsHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 1000 sec.

CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution

Deterministic task graphs, mapping & scheduling

16 25 36 49 64 81 100 1 2 3 4 5 6 7

31


Computational Scalability

Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability

Deterministic task graphs, mapping & scheduling & v,f selectionStochastic task graphs, mapping & scheduling & min bus usage


Optimality gapComparison with heuristic 2-phase solution (GA)

“timing barrier”

gap significant when constraints are tight

32


Optimization Development

The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.Programmers must be conscious about simplified assumptions taken into account in optimization tools.

Platform Modelling

Optimization Analysis

Optimal Solution

Starting Implementation

Platform Execution

Abstractiongap

(. .

Final Implementation

Challenge: the Abstraction Gap


MAX error lower than 10%AVG error equal to 4.51%, with standard deviation of 1.94All deadlines are met

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation -0.05

0

0.05

0.1

0.15

0.2

0.25

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions Throughput

Prob

abili

ty (%

)

Throughput difference (%)

33


MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions Power

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Prob

abili

ty (%

)Energy consumption difference (%)


GSM Encoder

Throughput required: 1 frame/10ms.With 2 processors and 4 possible frequency & voltage settings:

Task Graph:10 computational tasks;15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

34


Challenge: programming environment

A software development toolkit to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT in RT-OS (RTEMS)

The main goals are:predictable application execution after the optimization step;guarantees on high performance and constraint satisfaction.

Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmers can intuitively translate high level representation into C-code using our facilities and library


//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = 2,3,3,..;

#define N_CPU 2uint task_on_core[TASK_NUMBER] = 1,1,2,1;int schedule_on_core[N_CPU][TASK_NUMBER] = 1,2,4,8..;

uint queue_consumer [..] [..] = 0,1,1,0,..,0,0,0,1,1,.,0,0,0,0,0,1,1..,0,0,0,0,....;

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = 1,2,2,1,..;

ExampleNumber of nodes (e.g 12)Graph of activitiesNode type

Normal, Branch, Conditional, Terminator

Node behaviourOr, And, Fork, Branch

Number of CPUs : 2Task AllocationTask SchedulingArc priorities

Time

Res

ourc

es

N1 B2

B3

C4

C7

Deadline

N8

T2 T3

T4 T5 T6 T7

T8 T9 T10

T11

T12

T1N1

B2 B3

C4 C5 C6 C7

N8 N9 N10

N11

T12

fork

or

or

and

branch branch

P1P2

N11

N10

T12

a1a2

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

T12

a13

a14

#define TASK_NUMBER 12

35


Relationship with RT techniques

We can handle periodic task graphsMultiple rates can be analyzed by unrolling and periodic extension

Cannot deal with aperiodic/sporadic tasks unknown at design time

They would require unbounded unrollingCurrently assumes non-preemptive scheduling


Looking forward

Toward a mature SDKMature programmer support (Eclipse toolkit, OpenMAX support)Extend semantics (multi-rate SDF)Ports on real platforms (Cell BC underway, Nomadik is under discussion)

Optimization engine enhancementsDealing with multiple use casesVariable execution timesAggressive communication scheduling on NoCsAddress preemption and sporadic tasks

36


System optimizationAllocation and schedulingCommunication synthesis


Data Communication vs. Processing

I/O Bus

Main Bus

Core NµP

Core 2

µP Sub systemµP

Mem Bus

Core 1 SoCs

Circa 2004

SoCs Circa 2007

Critical Decision Was uP Choice

Critical Decision Is Interconnect Choice

Communication Architecture Design and Verification becoming Highest Priority in Contemporary SoC Design!

DRAMC

Exploding core counts requiring more advanced InterconnectsEDA cannot solve this architectural problem easilyComplexity too high to hand craft (and verify!)

Source: SONICS Inc.

37


Communication Architectures in today’s complex systems significantly affect performance, power, cost and time- to- market!

Communication Architectures in today’s complex systems significantly affect performance, power, cost and time- to- market!

communication architecture consumes upto 50% of total

on-chip power!

communication is THE most critical aspect affecting system performance

communication architecture design, customization,

exploration, verification and implementation takes up the

largest chunk of a design cycle

ever increasing number of wires, repeaters, bus components

(arbiters, bridges, decoders etc.) increases system cost

Need for Communication-centric Design Flow


Typical Industrial SoC Design Flow

algorithm selection

optimization

algorithm selection

optimizationfunctional model


architecture exploration


architecture exploration architecture model

communication model

implementation model

application requirements



interface synthesis

cycle scheduling

interface synthesis

cycle scheduling

Logic synthesis and physical implementation

ad-hoc partitioning and mapping; “on-paper” exploration

manual communication architecture selection

manual bus<->IP interface synthesis

very limited RTL exploration (~weeks)verification (~months)

physical implementation (~months)

no algorithm optimization

38


Physically Aware Bus Topology and Parameter Synthesis

Pasricha et al. [DAC 2005] presented the FABSYN approach, whichautomatically synthesizes bus topology and parameters (arbitration schemes, bus widths, bus speeds, DMA burst size)automatically detects and eliminate bus cycle timing violations during synthesis

Increasingly important in DSM era as clock speeds increase and lengthy propagation delays cause large number of timing violationsSaves costly design iterations during physical implementation

S1S1

S3S3

S2S2

MEM3MEM3M3M3

MEM2MEM2

M2M2

CPU1CPU1

MEM1MEM1

S4S4

M2M2

CPU1CPU1

S1S1

MEM3MEM3

MEM2aMEM2a

S3S3

S2S2

periphmain1

bridgebridge

MEM1MEM1 S4S4

MEM2bMEM2b

main2

M3M3

bridge bridge

bridge bridge

main3

bridgebridge

AutomatedBus Architecture

Synthesis

Floorplan and Wire Delay Estimation EnginePasricha et al. [DAC 2005]


FABSYN Synthesis Flow

CTGCTG

commarch.

commarch.

constraintSet (Ψ)

constraintSet (Ψ)

preprocesspreprocess

simple bus mapping

simple bus mapping

explore_paramsexplore_params

TCP met?

TCP met? mutate_topologymutate_topology

optimize_designoptimize_design

output synthesized communication archoutput synthesized

communication arch

IP library

IP library

Select unsatisfied TCP from Ω

Select unsatisfied TCP from Ω

Ω empty?Ω empty?

Run floorplannerand delay estimatorRun floorplanner

and delay estimator

Ω stillempty?Ω still

empty?

no

yes

no

yes

no

yes

Inputs Output

CTG or Communication Throughput Graphincorporates SoC IPs (nodes) and their interconnections (vertices)

TCP or Throughput Constraint Path is a CTG sub-graph representing constraint to be satisfied;Ω is a superset of all TCPs

Ψ or Communication Parameter Constraint Set is a discrete set of valid values for BA parameters, to ensure realistic output

Pasricha et al. [DAC 2005]

39


FABSYN Synthesis Flow Illustration

Pasricha et al. [DAC 2005]


Addressing Interconnect ScalabilityHigh-end industrial solutions:

Evolutionary path from shared busses

AMBA AXI

Protocol evolutionsAMBA AHB

AMBA AHB ML

Topology evolutions

ChallengesComplexity (e.g. 4-SHB + 2XBar, 75 actors): how to analyze and verify “spaghetti interconnects”?Scalability: bus is bandwidth-limited, Xbar is size-limitedPredictability: how to tie interconnects with floorplanning

AHB

AHB

AHB

40


The Network-on-Chip Paradigm

DSPNI

NIDRAM

switch

DMANI

CPU NI

NIAccelNI MPEG

switch

switch

switch

NoC

switch

switch

The “power of NoCs”:Clean separation at session layer

Cores issue end-to-end transactionsNetwork deals with transport, network, link, physical

Modularity at HW level: only2 building blocks

Network interfaceSwitch (router)

Physical design aware (floorplanglobal routing)

Scalability is supported from the ground up!


NoC Synthesis Project

SunFloor

TopologySynthesis

includes:FloorplannerNoC Router RTL

ArchitecturalSimulation

PlatformGeneration

Constraint graphComm graph

NoCArea models

Systemspecs

SystemCcode

NoCcomponent

libraryFPGA

Emulation

To fab

PlatformGeneration

(xpipes-Compiler)

Synthesis

Userobjectives:

power,hop delay

NoCPower models

Constraints:area, power,hop delay,wire length

IP Coremodels

Placement&Routing

Codesign,

Simulation

Application

Input trafficmodel

Area, power characterization

Started in 2002UNIBO, UNICA, Stanford, EPFLObjective: develop a complete EDA flow for NoC synthesis from application to P&R

Backend flow

Floorplanning specifications

41


The xpipes NoC

Packeting/unpacketingOCP 2.0 protocol to connect to IP coresSource routingDual Clock 2 Stage Pipeline

OCP OCP

OCP clkxpipes clk

OCP clk

packeting unpacketing

packetingunpacketing

initiator NI target NI

LUT

LUTpackets

request

response

Crossbar

AllocatorArbiter

Routing & Flow Control

Wormhole switchingRound-robin & fixed priority allocatorSupports ACK/NACK & STALL/GO flow control

ACK/NACK: Output buffered, 2 stagesSTALL/GO: Input buffered, 1 stage

Link pipelining fully supported

A soft macro library:

switch

xpipes switch(5x5 switch, 32b flit, 4-FIFO, 130nm)

909 MHz20 FO4 delay12.7 kgates (NAND2)0.087 mm2

35.2 uW/MHz (32mW@909MHz)

xpipes NI0.8 switch area


Backend Design Flow for Xpipes

xpipeslibrary

fabric instantiationxpipesCompiler

topologySystemC

trafficlogs

verification,power modelingMentor ModelSim

Synopsys PrimePower

powerfigures

trafficgenerators

architectural simulationcycle-accurate simulation platform

architecturalstatistics

performancefigures

areafigures

topologyspecs

fabric synthesisSynopsys Design Compiler

techlibrary

topologynetlist

place&routeCadence SoC Encounter &

Synopsys Astro

topologyfloorplan

HDL translationSystemC to Verilog

topologyHDL

Topologydesign

UnifiedSystemCfrontend

42


“High-level” Design Flow

a. Mesh b. Torus

Or, do I want a custom topology?

Automatically find topology, architectureMinimize area, power, latencySatisfying design constraints


SUNFLOOR & Xpipes Flow

SunFloor

TopologySynthesis

includes:FloorplannerNoC Router RTL

ArchitecturalSimulation

PlatformGeneration

Constraint graphComm graph

NoCArea models

Systemspecs

SystemCcode

NoCcomponent

libraryFPGA

Emulation

To fab

PlatformGeneration

(xpipes-Compiler)

Synthesis

Userobjectives:

power,hop delay

NoCPower models

Constraints:area, power,hop delay,wire length

IP Coremodels

Placement&Routing

Codesign,

Simulation

Application

Input trafficmodel

Area, power characterization

Floorplanning specifications

High-level design flow

43


Custom Topology & Mapping

ObjectivesDesign fully application-specific custom topologiesGenerate deadlock-free networks: both routing and message-level deadlocks are removedOptimize architectural parameters of the NoC (frequency, flit size), tuning based upon application requirements

Leverage accurate analytical models for area and power, back-annotated from layoutsIntegrated floorplanner to achieve design closurewhile also considering wiring complexity

Physical design awareness


Vary NoCarchitecturalparameters

frequency, data-width

Bandwidth, power consumption varies

SUNFLOOR Steps

Vary numberof switches

Which is better ? – Do not know !!

44


NP-Hard problem (single path multi-commodity flow)

Use fast & efficient heuristics




SUNFLOOR Steps

Synthesize best topology


SUNFLOOR Steps



Synthesize best topology

Perform floorplan of design Calculate timing characteristics

If design constraintsmet, save solution

Choose most efficientsolution satisfying

all design constraints


45


vld rld iqn

vprpad

smm

70

27

357

362

49

313

94500

353

300

16

Core graph

isn ups

313

idctarm

a-d

vpm

Synthesis Algorithm

Obtain min-cut partitions of core-graph

Cores in a partition share a switch

Find lower bound on switch sizes, switch power

Establish switch connectivity by routing flows

Account for constraints on # hops, deadlock avoidance, switch sizeMinimize power

Refer to Murali et al. ICCAD06 for full details


Processor-memory cluster

Case Study 1: Comparison AgainstHand-Mapped Topology

P-processors, M-private memories,

T-traffic generators, S-shared slaves

Hand-mapped topology SUNFLOOR custom topology

Bi-directional links

Bi-directional links

Uni-directional links

On our 30-core multimedia benchmark

46


Case Study 1: Results vs Hand-Mapped

Hand-mapped design:

• Topology: 5x3 mesh(15 switches)• Operating frequency:793 MHz (post-layout)• Power consumption:368 mW• Floorplan area:35.4 mm2

• Design time: weeks•0.13 μm technology

Hand-mapped design:

• Topology: 5x3 mesh(15 switches)• Operating frequency:793 MHz (post-layout)• Power consumption:368 mW• Floorplan area:35.4 mm2

• Design time: weeks•0.13 μm technology

SunFloor:

• Topology: custom(8 switches)• Operating frequency:793 MHz (post-layout)• Power consumption:277 mW (-25%)• Cell area:37 mm2 (+4%)• Design time: 4 hours design to layout•0.13 μm technology

SunFloor:

• Topology: custom(8 switches)• Operating frequency:793 MHz (post-layout)• Power consumption:277 mW (-25%)• Cell area:37 mm2 (+4%)• Design time: 4 hours design to layout•0.13 μm technology

Benchmark execution time comply with application requirements and are even 10% better on SunFloor topology.

constraint


1.152.002.00

20.5390.1738.60

CustomMesh

Opt-mesh

MWD(12 cores)

1.332.002.00

30.0095.9446.48

CustomMesh

Opt-mesh

VOPD(12 cores)

1.502.172.17

27.2496.8260.97

CustomMesh

Opt-mesh

MPEG4(12 cores)

1.672.582.58

79.64301.8136.1

CustomMesh

Opt-mesh

VPROC(42 cores)

Avg. nr. hopsPower(mW)TopologyApplication

Case Study 2: SUNFLOOR Vs Regular Topologies

On average, SunFloor custom topologies:

2.75x less power consumption

1.55x less hop delay

Despite large design space, maximum run time of 1 hour for VPROC

47


Looking Forward

Quality of service guarantees for critical trafficRun-time configurabilityRobustness w.r.t. to static/dynamic variations, errorsNetwork interfaces: interoperability, performance


Summary

MPSoC design technology is in fast evolutionSupport for functional design is reaching industrialmaturity

Virtual platformsIP reuse standardization

Support for analysis of non-functional properties isimmature

Even functional analysis is only simulation-basedSystem optimization is at research stage

Part IV - Design technology for MPSoCs - user.it.uu.seuser.it.uu.se/~yi/pdf-files/LucaBeniniSuZhou07/Lec4-DesignTech.pdf · Part IV - Design technology for MPSoCs ... RTL Early 90’s

Documents