Defense

NCKU, Low power, high performance VLSI design lab

Design Automation Tool from Behavior Level to Transaction Level for Virtual Bus-Based Platforms

Advisor: Lih-Yih Chiou Student: Hi-Ho Chen

23 June 2008

2


Outline

Motivation and ContributionsPrevious WorksProposed Design Automation Tool from Behavior Level to Transaction Level for Virtual Bus-Based Platforms

RepresentationDesign Flow OverviewBlock Level

MethodologyTranslation

Platform LevelDevelop Library for CoWareSystem Control Generator

ExperimentsScalar 176*144DWT 44*36

Conclusions and Future worksReferences

3


Introduction

Entering SoC era, more and more IPs are integrated onto one single chip

ESL (Electronic System Level) design is proposed to rapidly allow designer to simulate the system function behavior at higher level before hardware implementation

Communication design has become one of the important criteria for SoC design

4


Top-down Design Flow

Product Requirements from customer

Specification Model

Architecture Model

Communication Model

Implement Model

Algorithm select Optimization

AllocationBehavior partitioning

scheduling

Protocol selectionChannel partitioning

arbitration

Cycle schedulingProtocol Scheduling

1

2

3

4

5[1]S. S. Pasricha, N. Dutt, and M. Ben-Romdhane, "Using TLM for exploring bus-based SoC communication architectures," 16th IEEE International Conference on Application-Specific Systems, Architecture Processors, 2005, pp. 79-85, 2005

5


Arbitration Level vs. Simulation Speed

[2]C. Lennard and D. Mista, "Taking Design to the System Level," 2006 [Online]. Available:(http://www.arm.com/pdfs/ARM_ESL_20_3_JC.pdf)

6


High Level Synthesis

Behavior Synthesis

Separate the Control and Data path from the behavior description

Control

If then else

Switch case

Data PathData flow

x=a+b;c=a<b;If(c){

d=c-f;}

Else{ g=h+I;

}J=d*g;L=e+x;

x=a+b;c=a<b;

c

d = e-f; g =h+i;

j = d*g;l=e+x;

control

Memory

MUX MUX MUX

x ALU

MUX

Data pathControl

[3]SPARK. Methodology, http://mesl.ucsd.edu/spark/methodology.shtml

7


Contributions

Rapid system explorationFast exploration of multiple micro-architecture alternatives

Shorter verification/simulation cycleSpeed up with behavior-level to transaction level

Quickly obtain the power and performance informationEarlier estimation of design specifications

Increase the performance Reduce the communication & computation

8


Outline







9


Previous Works - SPARK(1)

Input : C

Output C VHDL

Advantages :They define a new synthesis tool for parallel design

Disadvantages :No platform architecture

No communication issue

[4]SPARK:A High-Level Synthesis Frame work For Applying Parallelizing Compiler Transformations VLSI Design, 2003. Proceedings. 16th International Conference on 4-8 Jan. 2003 Page(s):461 – 466

Phase 1

Phase 2

Phase 3

10


Previous Works - xPilot(2)

Input: c/SystemC

Output: Verilog/SystemC

MethodPhase 1

SSDM

Phase 2Synthesis

Advantages:Directly mapping to FPGA

Quick Verification

Disadvantages:No communication issue

[5]“Platform-Based Behavior-Level and System-Level Synthesis“International SOC Conference, 2006 IEEE Sept. 2006 Page(s):199 – 202

Phase 1

Phase 2

11


Previous Works - MFASE(3)

MFASE:(Multiple Functions SoCs Analysis Environment)

Design Flow HW/SW Partition.

Architecture mapping. communication analysis.

…..Advantage

HW/SW co-design

Limitation IP Data Base

[6]MFASE: Multiple Functions SoCs Analysis Environment the VLSI Desing/CAD Symposium, Taiwan, Augest 2007

12


Summary

Previous worksSynthesis tool

SPARK & xPilot Synthesis from hardware C code to RTL Verilog code

SPARK & xPilot did not consider communication issue

MFASE did not mention about how to generate automatically

ThesisBuilding a automation tool from Functional Level to Transaction Level for virtual Bus-based Platform

Computation & Communication issues

Automation tool from Behavior Level to Transaction Level

13


Outline







14


Representation

Example C to CDFG

Example for “If the else”

Example for

“for loop”

Condition

END

If Body

ElseBody

TrueFalseif(a==0){b=c+d;}else{b=c-d}

a==0

END

b=c-d

TrueFalse c d

+

b

for(i=0;i<5;i++){c=a+b;}

i=0

i<5

Body

i=i+1

END

True

False

a b

+

C

Initial

Condition

Body

Update

END

True

False

15


Outline

Motivation and ContributionsPrevious worksProposed Design Automation Tool from Behavior Level to Transaction Level for Virtual Bus-Based Platforms






16


Design Flow Overview 1/2

Profiling & Analysis

Translation

SystemC.cpp

SystemC.h

Until all Spec C have been translate

Spec C to CDFGTranslation

Spec C Spec CSpec C

Link Port Setting

tcl Transaltion

Wrapper Library

Simulation on Coware CoWare

Library

ConnectsConnects

PMUGenerator

CTLGenerator

Link Port

Approximated-TimeSimulation

Platform Level using simple Bus

Platform Level using

CoWare

Block Level

17


Design Flow Overview 2/2

Block LevelMethodology

Parallel

Cascade (Multi cycle)

TranslationState & Edge Reduction

STG to SystemC generator

Platform Level using Simple BusApproximate time simulation

Platform Level using CoWare*.tcl generator

Peripheral generator

18


Outline







19


Block Level

Input

Functional Level CDFG

Block Inside Configuration

Max Parallel deep

Buffer Size

Boundary Case

Block to Bus Configuration

Max Burst size

Initial Address

Address offset

Output

TLM SystemC

CDFG

CDFG Analysis

Power Lib

Parallel analysis

W r a p p e r

L i b

Boundary analysis

Block synthesis&& interface synthesis

Synthesisconfigure

Irregularity analysis

State && Edge Reduction

Performance Estimation

For loop condition

No

Yes

Approximate time Cycle time

Implement SelectState Reduction

Parallel&

internalCommunication

analysis

20


Outline


RepresentationDesign FlowBlock Level


Platform LevelDevelop librarySystem Control generator

ExperimentScalar 176*144DWT 44*36


21


ForBegin

ForBegin

Body(0,4)

Body(0,6)

Body(1,4)

Body(1,6)

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

Body(0,3)

Body(0,5)

Body(1,3)

Body(1,5)

ForBegin

ForBegin

Body(0,3)

Body(0,4)

Body(0,5)

Body(0,6)

Body(1,3)

Body(1,4)

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8 9 10

Body(1,5)

Body(1,6)

11 12

Block Level - Methodology1/10

Computation Reduction

Parallel analysisStep 1: C to CDFG format

Step 2 : un-rolling the “for loop” to know the cycle counts

Step 3 : find the Solution to fit the “for loop” condition

Under Hardware constrain

GCD Methodology

Step 4: We will find the closed solution based on the Hardware condition

Step 5: update CDFG

for(j=0;j<2;j++){ for(i=3;i<7;i++){ b[j][i] = (a[j][i]+a[j][i+1])>>1; }}

22


Assume a[6][8]

Address

a[0][3] = address 12a[0][4] = address 16a[0][5] = address 20

a[1][3] = address 44a[1][4] = address 48a[1][5] = address 52

Memory

Addr 12

Addr 44

Block Level – Methodology 2/10

Communication factorsWe assume the array will be located in the external MemoryHow can we get data from external memory?Bus Transform

SingleBurst

Buffer Size requirementParallel & size of data transformation will influence the performance and power

Mem 1

Bus

A[j][i]

IBuff OBuff

Mem 2

Read Write

Burst

New Transform

23



Communication Reduction

Case 1 ForBegin

ForBegin

Body(0,4)

Body(0,6)

Body(1,4)

Body(1,6)

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

Body(0,3)

Body(0,5)

Body(1,3)

Body(1,5) Case 2 For

BeginFor

Begin

Body(0,4)

Body(0,6)

Body(1,4)

Body(1,6)

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

Body(0,3)

Body(0,5)

Body(1,3)

Body(1,5)

Case 3 ForBegin

ForBegin

Body(0,4)

Body(0,6)

Body(1,4)

Body(1,6)

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

Body(0,3)

Body(0,5)

Body(1,3)

Body(1,5) Case 4 For

BeginFor

Begin

Body(0,4)

Body(0,6)

Body(1,4)

Body(1,6)

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

Body(0,3)

Body(0,5)

Body(1,3)

Body(1,5)

24


T(1) WR

B(3) B( 2)

T(2) WR

B(3) B( 2)

T(3) WR

B(3) B( 2)

T(4) WR

B(3) B( 2)

b[0][3] = (a[0][3]+a[0][4])>>1;

b[0][4] =

(a[0][4]+a[0][5])>>1;

b[0][5] = (a[0][5]+a[0][6])>>1;

b[0][6] =

(a[0][6]+a[0][7])>>1;

b[1][3] = (a[1][3]+a[1][4])>>1;

b[1][4] =

(a[1][4]+a[1][5])>>1;

b[1][5] = (a[1][5]+a[1][6])>>1;

b[1][6] =

(a[1][6]+a[1][7])>>1;


Case 1:

parallel deep 2operator 1 cycle

Irregularity: 1Buss Access times: Read : 4 Write : 4

Max Buffer Size usage :3

B(): Burst sizeT(): Transaction numberR: Read from busW: Write to bus

ForBegin

ForBegin

For For For For

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

For For For For

Case 1

25



Case 2 :

parallel deep 2 operator 2 cycles

Irregularity : 1Bus Access times: Read 2: Write 2Max Buffer Size usage :5

b[0][3] = (a[0][3]+a[0][4])>>1;

b[0][4] =

(a[0][4]+a[0][5])>>1;

b[0][5] = (a[0][5]+a[0][6])>>1;

b[0][6] =

(a[0][6]+a[0][7])>>1;

b[1][3] = (a[1][3]+a[1][4])>>1;

b[1][4] =

(a[1][4]+a[1][5])>>1;

b[1][5] = (a[1][5]+a[1][6])>>1;

b[1][6] =

(a[1][6]+a[1][7])>>1;

T(1)R

B(5)

T(2) W

B( 4)

R

B(5)

T(4) W

B( 4)

T(3)


ForBegin

ForBegin

For For For For

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

For For For For

Case 2

26



Case 3:

parallel deep 2operator 3 cycles

Irregularity : 2Bus Access times: Read 3: Write 3Max Buffer Size usage :8

T(1)R

B(3)

T(2) W

B( 2)

R

B(3)

W

B( 2)

T(3)R

B(5)

T(3) W

B( 4)

b[0][3] = (a[0][3]+a[0][4])>>1;

b[0][4] =

(a[0][4]+a[0][5])>>1;

b[0][5] = (a[0][5]+a[0][6])>>1;

b[0][6] =

(a[0][6]+a[0][7])>>1;

b[1][3] = (a[1][3]+a[1][4])>>1;

b[1][4] =

(a[1][4]+a[1][5])>>1;

b[1][5] = (a[1][5]+a[1][6])>>1;

b[1][6] =

(a[1][6]+a[1][7])>>1;


ForBegin

ForBegin

For For For For

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

For For For For

Case 3

27


for(j=0; j<2; j++){ for(i=3; i<7; i++){ b[j][i] = (a[j][i]+a[j][i+1])>>1; }}

boundary

Memory

ADDR relation

New Transform

j

i


Boundary case

Limitation: high address relation

Relation with the Memory location

28



Case 4:

parallel deep 2operator 4 cycles

Irregularity :1Bus Access times: Read 2: Write 2Max Buffer Size usage :10

b[0][3] = (a[0][3]+a[0][4])>>1;

b[0][4] =

(a[0][4]+a[0][5])>>1;

b[0][5] = (a[0][5]+a[0][6])>>1;

b[0][6] =

(a[0][6]+a[0][7])>>1;

b[1][3] = (a[1][3]+a[1][4])>>1;

b[1][4] =

(a[1][4]+a[1][5])>>1;

b[1][5] = (a[1][5]+a[1][6])>>1;

b[1][6] =

(a[1][6]+a[1][7])>>1;


ForBegin

ForBegin

For For For For

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

For For For For

Case 4

T(1)R

B(5)

T(2) W

B( 4)

T(4)R

B(5)

T(3) W

B( 4)

29



Which case is better for implement?Problem

Case 1single operator cycleBus Access times

Case 3Control is so complexityNo considering the Boundary case

Case 4Buffer size

We choose “Case 2” to implementUnder Boundary case conditionUnder Buffer size constrainBus Access issueregular

Case 1 Case 2 Case 3 Case 4

Irregularity 1 1 2 1

Boundary Case

O O X O

Max Buffer size

3 5 8 10

Read Bus Access times

4 2 3 2

Write Bus Access times

4 2 3 2

30


Boundary condition Cycle 8Parallel deep 2

O(2) O(3) O(4)O(1)

B(5)

S(4) W(2)R(2)

Ir(1) Ir(1) Ir(2) Ir(1)

B(8)B(5)

S(4) W(4)R(4)


Under condition Parallel deepBoundary Case

AnalysisStep 1: Trace states by operator cycles

Step 2: separate the Read and Write part,find the period

Step 3: estimation the cycles and hardware cost

Step 4: find the best solution

O(): operator cyclesB(): buffer sizeR(): Read countsW(): Write countsS(): state sizesIr(): Irregularity

case1 case2 case3 case4

31


Outline







32


Translation 1/3

Example for CDFG to state transaction graph (STG)Fit to time step

Easily to FSM Generator

If Then Else

Begin

If Body

Else Body

If Then Else END

a==0

END

b=c-d

False

b=c+d

a==0

a!=0

Cycle 1

Cycle 2

Cycle 3

b=c+d

b=c-d

a==0

a!=0

i=0

i<5

c=a+b

i=i+1

END

True

False

for Begin for Body for End

True

False

i=0c=a+bi=i+1

i<5

i>=5

Cycle 3

Cycle 2

Cycle 1

Example for ”If then else”

Example for ”for loop”

33


Translation 2/3

Step 1

CDFG to STG

Un-rolling “for loop” condition

Step 2

Methodology

Reduce Computation

Parallel

Reduce Communication

Cascade

Architecture definition

Step 3

Translate to TLM SystemCHeader

Function

for begin for end

for begin Body for end

ForBegin

ForBegin

For For For For

ForEnd

ForEnd

Cycle1 2 3 4 5 6 7 8

For For For For

Case 2

for Body

RD_REQ READ EXE Write WT_REQ

!Grant

Grant

!Read done

Read done

!Writedone

Writedone

!Grant

Grant

34


Translation 3/3

Block Level

Interface

Block to Wrapper

Block to Block

Control

FSM

Data path

Operator assignment

Control signalsBlock to Wrapper

Block to Data path

Block to Buffer

DataPath(1)

DataPath(2)

Input Buffer

CTL

Output BufferWrapper

Bus

Wrapper Interface

Block Interface

35


Outline







36


Platform Level

Input :Port mapping

Library location

CoWare setting

Output*.tcl for CoWare based

Communication GeneratorSystem Control

Wrapper

Mux

PMU

Interrupt

Communication generator

Communication &&Wire

configure

Peripheral Generator

Platform Level

SystemControl

Generator

WrapperGenerator

PMUGenerator

MuxGenerator

InterruptGenerator

Platform Generator

37


Outline







38


Develop Library for CoWare 1/3

Master Wrapper Generator

Base on CoWare API fo

r AMBA AHB

Advantage

Support any burst type

Burst Lock

LimitationBuffer size

IP_Dout

IP_OutValid

IP_OuMemType

IP_OuMemReq

IP_OuMemRNW

IP_OuMemCounts

IP_OuMemAccess

IP_OuMemAddr

finish3

WR_InValid

WR_Din

WR_RelReq

Data_Out_Done

Synchronizer

InBuffer

OutBuffer

Synchronizer

Input Handshake

Output Handshake

FSM

From BusFrom IP

Clk Rst

Generated Blocks

StartMaster

AHB Bus

AHBInitiator_inoutmaster_port

Wrapper

39



PMU Generator

Input :Configure

Block Num: Default 3

Idle cycle: Default 1000

Wake Up cycle: Default 1000

Policy: fixed-time out Policy

Output : SystemC

Start

PMU_Out_Ready_0

PMU_Out_Idle_0

PMU_In_PB_CL_0

PMU_In_CL_0

PMU_out_CLkg_0

Clk Rst

FSM

Register

Signal Detect

PMU

LUT

40


900 ns 950ns

time

Freq = 70 MHz

1* *Active ActiveN PFreq1

* *Idle IdleN PFreq

• Known parameters

• Total simulation time

• Operation frequency

• Active duration

• Total active number

ACT Energy

Idle Energy

Total energy = (ACT Energy + Idle Energy)

Power= (ACT Energy + Idle Energy)/total time

idleN Number of Idle counts

IdleP Idle power/unit time

ActiveP Active power/unit time

ActiveN Number of Active counts

Power Calculation


41


Outline







42


System Control Generator

TOP Control Generator

InputBlock scheduling

Block numbers

Type settingParallel

Pipeline

Single (Default)

OutputSystemC Start

CtlDone1

Clk Rst

CtlEnable1

ToFinishFSM

Synchronizer

Synchronizer

CTL

Block 1

Block 2

Block 3

Enable Block 1

Enable Block 2

Enable Block 3

43


Outline







44


CoWare - ScalarSequence : Foreman, Football(30 frames)

45


Simple Bus Environment - Scalar

Y Cb Cr

Mem 1 Mem 2

Arbiter

CPU

Simple Bus

SystemC 2.1 Simple bus Read Transfer

Write Transfer

46


CoWare Environment -Scalar

Top Platform for scalar application

CTL

Y

Cb

Cr

PMU

Mux

Interrupt

Wrapper

Step 1Step2

Step3

Step4 Step 5

47


Experiments – Scalar

Performance with app-time and cycle time

Scalar performance && State size in Cycle time base

scalar Y part Cb part Cr part

cycle cycle cycle

Approximate time 239761 91775 91775

Cycle time 325296 100638 100638

Scalar Y part

Parallel constrain 4

Maxcascade

STSize

BusAccess

ComputationCycle

Communication Cycle

Code Line

Original C code 0 0 0 126720 0 23

case 2 4 78 9916 31680 115118 1403

case 3 11 81 1724 31680 33388 1668

48


Experiments – Power Monitor

Power Library

Method Search the Look up table

Block -> Module

FSM switch

InBuffer

OutBuffer

Register

Block ->Data PathOperator

Data Path

Size Active power Idle power

ADD 8 1.0444 mw 23.4124 nw

SUB 8 808.2718 uW 21.3216 nW

DIV 8 4.0100 mW 67.5333 nW

SHR 8 425.8246 uW 9.9244 nW

width Power power Idle power

FSM 6 0.418mw 12 nw

Buffer 32 1.7346mw 66 nw

Register 32 1.7346mw 66 nw

49


Experiments - Scalar

Scalar176*144 Power saving

Case ActiveCycle

Wake upCycle

SleepCycle

Power mw

Scalar YNO PMU 526572 X X 22584673.08

WITH PMU 326296 1000 199276 14038522.54

Scalar CbNO PMU 526572 X X 11000089.08

WITH PMU 101638 1000 423934 2124065.68

Scalar CrNO PMU 526572 X X 11000089.08

WITH PMU 101638 1000 423934 2124065.68

No PMU with PMU Power Saving Rate

Scalar 44584851.24mw 18286653.9mw 58.98%

50


DWT && IDWT

Experiments - DWT

DWT

IDWT

51


Experiments - DWT

Top Platform for DWT application

Step 1Step 2

Step 3

Step 4

52


Experiments - DWT

Performance with app-time and cycle time

DWT performance && State size in Cycle time base

DWT cycle

Approximate time 11088

Cycle time 76262

DWT

Parallel constrain 1

Maxcascade

STSize

BusAccess

ComputationCycle

Communication Cycle

CodeLine

Original C code 0 0 0 1584 0 46

case 1 1 42 9504 1584 74678 8630

53


Experiments - DWT

DWT 44*36 Power saving

ActiveCycle

Interrupt Cycle

SleepCycle

Power mw

DWT

NO PMU 145962 X X 4066501.32

WITHPMU

76362 1000 68600 2155442.52

IDWT

NO PMU 145962 X X 4415350.53

WITHPMU

69600 1000 75362 2105550.72

NO PMU With PMU Power Saving Rate

DWT 8481851.85mw 4260993.24mw 49.765%

54


Outline


RepresentationDesign FlowBlock Level


Platform LevelDevelop librarySystem Control generator


Conclusions and Future WorksReferences

55


We develop a Automation tool from behavior level CDFG to TLM level SystemC for virtual bus based platform design

We have also incorporated some method to reduce the Bus Access times for the system design at the Architecture level profiling

We develop some library for virtual bus based platform

We can fast explore the Architecture to reduce the verification time

Conclusions

56


Future Works

Model each module’s power using equations so that a more accurate power management could be carried out

Adding a test platform into the tool so that the corresponding test circuitry could be generated automatically

Including more hardware architectures to extend the Hardware Library so that designer can have more design options to choose

57


References

[1]S. S. Pasricha, N. Dutt, and M. Ben-Romdhane, "Using TLM for exploring bus-based SoC communication architectures," 16th IEEE International Conference on Application-Specific Systems, Architecture Processors, 2005, pp. 79-85, 2005

[2]C. Lennard and D. Mista, "Taking Design to the System Level," 2006 [Online]. Available:(http://www.arm.com/pdfs/ARM_ESL_20_3_JC.pdf)

[3] SPARK Methodology, (http://mesl.ucsd.edu/spark/methodology.shtml)

[4] S. Gupta, S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, "SPARK: a high-level synthesis framework for applying parallelizing compiler transformations," Proceedings of 16th International Conference on VLSI Design, 2003, pp. 461-466, 2003

[5] J. Cong, F. Yiping, H. Guoling, J. Wei, and Z. Zhiru, "Platform-Based Behavior-Level and System-Level Synthesis," in IEEE International SOC Conference, 2006, pp. 199-202, 2006

[6] Ya-Shu Chen, Shih-Chun Chou, Chi-Sheng Shih and Tei-Wei Kuo, "MFASE: Multiple Functions SoCs Analysis Environment," in the VLSI Design/CAD Symposium, Taiwan, August 2007, 2007

58


Thank you

Defense

Technology

behavior level

higher level

functional level

arbitration level

platformbased behaviorlevel

soc design

design automation tool

contributions previous