MPSoC Platform Design and Simulation for Power %0A Performance Estimation

MPSoC Platform Design and Simulation for Power

Performance Estimation

Zhengjie Lu

Eindhoven University of Technology, Electrical Engineering Department

Den Dolech 2, Postbus 513, 5600 MB, Eindhoven, The Netherlands [email protected]

Abstract—Wireless sensor nodes (WSN) play an important role

in future intelligent applications, such as remote medical

examination and unattended industry-field monitoring. Since such a

device is usually powered by a battery, the design trade-off between

the performance and the power consumption is critical in achieving

maximum battery life. A configurable platform with the

performance and high-level power simulations is essential to

determine the feasible designs in the early design stage. In this

paper, we present such an MPSoC platform and its dynamic power

model. A work flow to enable the design automation with the help

of CHESS and CHECKER1 is provided. An ECG beat-detection

application is implemented on our platform as a case study. The

simulation results predict that the P2P communication architecture

with the software pipelining is optimal on an ECG MPSoC platform

in both the power-constraint scenario and the time-constraint

scenario.

Index terms—MPSoC platform, dynamic power model, work

flow, low power application, high-level power simulation.

1. INTRODUCTION

Wireless sensors are normally constrained by a limited

power budget, which makes a low-power design essential [1].

A favoured approach is to reuse the dedicated hardware

which is optimized for the low power application and

distributed as the Intellectual Property (IP). A full-system

power simulation is then necessary to determine the most

energy efficient configuration.

Power estimation is widely accepted in industry and it

offers promising accuracy. Unfortunately, its simulation

speed can decrease dramatically as the system becomes

complex (e.g. a triple-core processor). This is not efficient in

the early design stage. In fact, a rough estimation on power

consumption is good enough in the early design stage to

compare different configurations. For this reason, a higher-

level power simulation without high accuracy is acceptable

as the starting point of the system design.

Instruction-set simulation (ISS) is a kind of high level

performance simulation, which provides the cycle-accurate

profiles of a single processor with adequate fast simulating

speed. Both the number of active cycles and the energy

consumption of function units are collected during the

1 CHESS and CHECKER are both commercial compilers

from TargetCompiler N.V.

simulation. To support the multi-core simulation, ISS’s are

embedded within a co-simulation environment. The energy

of a full system is the sum of all ISS active energy plus inter-

ISS communication energy. However, different ISS’s hold

different I/O interfaces to the rest of the system and the

alternation in networked architecture requires reworking on

the communication interface.

The group of ULP-DSP (Ultra-Low Power Digital Signal

Processing) at imec Netherlands [24] is developing the ultra

low power WSN for health care and industrial monitoring. A

target multi-core system employs application-specified

instruction processors (ASIPs). The ASIP hardware is

developed with TargetCompiler Designer Tool [2], which is

appropriate for low power applications. The software

running on such an ASIP is compiled by the CHESS

compiler [2]. Besides, ISS for dedicated ASIP is generated

by CHECKER [2], which generates the profiles of software

execution and hardware usage.

Three aspects can be improved in the work flow of

TargetCompiler. First of all, the hardware/software co-

design for multi-core is inadequately supported. An ASIP

and its accompanying software are designed from a single-

core’s point of view. Questions about how these ASIPs are

interconnected and what the optimal networked architecture

is are not optimally supported in the current work flow [2].

Secondly, a convenient programming model for multi-core

applications in TargetCompiler workflow is not provided.

Thirdly, the power simulation is not easy due to the lack of

power models.

A flexible multi-processor system-on-chip (MPSoC)

platform with high-level power models is set up in this paper,

improving the work flow of TargetCompiler. It enables the

design space exploration on the performance and the power

estimation.

This paper is organized as follows. Section 2 introduces

the related work and our contribution. An MPSoC platform

and its power model are described in section 3. The power

models are intensively studied in section 4. A detailed

explanation of our proposed work flow is given in section 5.

Section 6 shows a case study of ECG application on our

platform. Experiments are reported in section 7. Conclusions

and future work are given in section 8.

2. RELATED WORK AND CONTRIBUTION

Many MPSoC platforms have already been proposed in

academic society. In [3] and [4], SimpleSclar [5] is

embedded into the systemC-based framework for co-

simulation. A general ISS-wrapper interface is introduced in

[6], which extends its use to not only SimpleScalar but also

other ISS’s. A full-system platform is also proposed in [7] as

an extension of M5 [8]. To speed up the simulation, mixed-

level simulation is introduced in [9]: the intra-core

computation is simulated by ISS on the cycle-accurate level

and the inter-core communication is simulated by OSCI

TLM-2 [10] on the transaction-accurate level. Application

specified MPSoC platforms are also presented in [11] and

[12]. A composable and predictable MPSoC platform

template for streaming application is proposed in [11]. A

work flow for MPSoC platform automatic synthesis is

developed in [12]. None of them supports power simulation.

To bridge this gap, [13] and [14] integrate Wattch [15] in

their platform to estimate the dynamic power consumption

of SimpleScalar cores. Orion2 [16] combines the core power

model of Wattch with a router power model, aiming at

network-on-chip architectures. Besides, a universal power

simulator McPAT is proposed in [17], which takes the

output of a cycle-by-cycle performance simulator as its input.

Unfortunately, an additional parser is needed for the

integration of McPAT and MPSoC platforms. Unfortunately,

a TargetCompiler ISS cannot be inserted immediately into

the MPSoC platforms above.

This paper is devoted to meet these challenges. Three

contributions are involved in our work:

1) A general MPSoC platform with common memory

interfaces to an ISS (e.g. TargetCompiler ISS).

2) A high-level dynamic power model for our proposed

platform, aiming at predicting the dynamic power

trend in the early design stage.

3) Investigating the impacts of the communication

architecture and program coding on the dynamic

power consumption of an ECG application.

3. MPSOC PLATFORM

A parameterized MPSoC platform not only benefits the

module reuse, but also enables the design space exploration

(DSE). Three sub-systems are included as shown in Fig. 1,

being: (1) IP core subsystem (e.g. IP core 1), (2) I/O tile

subsystem (e.g. I/O tile 1), and (3) OCCN subsystem (e.g.

OCCN bus and P2P). IP core subsystem is the systemC

wrapped ISS (see Fig. 2), while I/O tile and OCCN

subsystem function together as the inter-core communication.

In the following we explain the architecture aspects in

section 3.1, while the programming model on this platform

is introduced in section 3.2.

3.1. Architecture

In Fig. 1, an IP core is connected to an I/O tile subsystem

through its systemC wrapper interface and as a result it can

access either the program memory (PM) or the data memory

(DM) within the I/O tile. Each I/O tile can exchange data

through bus or peer-to-peer-link (P2P) using network

interfaces (NI). A shared memory module can be derived

from the I/O tile if only the DM, the arbiter and the slave NI

are presented in the I/O tile, as shown in Fig. 1. Each

component in Fig. 1 is explained in the following sections.

…

Core address mapping

DM

Arbiter

IP core 1 PM

OCCN bus

Master

NISlave

NI

I/O tile 1

OCCN P2P

Core address mapping

DM

Arbiter

IP core 2 PM

Master

NISlave

NI

I/O tile 2

…

DM

Arbiter

Slave

NI

Shared-memory

Fig. 1. Architecture overview A

dd

ress

Bu

s

Da

ta

Bu

s

ISS

SystemC wrapper

Co

ntr

ol

Bu

s

Address

Bus

Data

Bus

Control

Bus

Interface to

Program

Memory

Interface to Data Memory

DAT2

DAT1

DAT1

DAT2 DAT2

DAT1

DAT1

DAT2

M_NI S_NI

MA

ST

ER

OU

T

MA

ST

ER

IN

REQACK REQ

CO

RE

IN

CO

RE

OU

T

ACK

DM IN DM

OU

TS

LA

VE

OU

T

SL

AV

E

IN

32-bit

16-bit

32-bit 32-bit 32-bit

16-bit 16-bit 16-bit

Fig. 2. IP core subsystem Fig. 3. Multi-word access

through NI

3.1.1. IP core: In our situation, an IP core is presented as a

CHECKER generated ISS. The ISS is supposed to load an

executable which is compiled by the CHESS compiler, and

to perform a cycle-accurate simulation. As we mentioned

above, it needs to be wrapped as a systemC class before

being integrated into the platform. CHECKER can also

perform this job which results in the memory mapped I/O

(MMIO) interface as shown in Fig. 2. Notice that the

addressing bus to the data memory is isolated from the

program memory. The same holds for the data bus and the

control bus. All bus widths are determined by the IP core’s

specifications.

3.1.2. Core address mapping: A core-address-mapping

module provides a de-multiplexer with a single input port

and two identical output ports, bridging the core and I/O tile.

Both input port and output port are configured based on the

system specifications so that they can adapt to different IP

core’s I/O width. A core address mapping decides whether

the IP request goes to the local DM or external DMs. A DM

is local to an IP core when it resides inside the core’s I/O tile.

The remainder of DMs are declared as the external DMs

from this core’s point of view. In our case, the MMIO

address space of a IP core is divided into two continuous

sections. The core address mapping module just maps the

first section to the local DM, and the rest to the external

DMs. Details about the local/external address will be

discussed in section 4.1.

3.1.3. Arbiter: Simultaneous requests from different IP

cores to the same memory must be supported on our

platform. A multi-ported memory seems to be an easy

solution, but it can result in the high power consumption as

well as the large silicon area [18]. For this reason, a single-

ported memory is preferred in the practical low power

designs. As a consequence, arbitration is necessary to

sequence the concurrent requests to the single-ported

memory. However, it may increase the memory access time

from the IP core’s point of view, i.e. one additional cycle is

needed for arbitration. It may become even worse when the

external request competes for the local memory at the same

time. Round-robin scheduling is currently implemented

within the arbiter. which guarantees the IP core can access to

its local DM in no later than 4 cycles (i.e. 2 cycles for

waiting the previous request accomplishment and 2 other

cycles for processing this access) from the IP core’s point of

view. No arbitration is needed for accessing to PM, because

we assume no re-configuration (reloading PM) while

running. Also we assume PM is not the instruction cache.

3.1.4. Memory: The DM module provides universal

interfaces to adapt to different accessing modes: word, byte

and multi-word access. The IP core determines what modes

are supported in local memory and what size a word/vector

is. Any mode of the access to DM is accomplished within 2

cycles from the memory’s point of view. The same holds for

access to the PM module.

3.1.5. OCCN bus/P2P: The OCCN network library [19],

which is developed in the systemC community, is employed

in our platform to simulate the inter-core communication. It

provides two types of communication on the transaction-

accurate level: bus and P2P. The advantages of the

transaction accurate level simulation are the fast simulation

speed and the high level of abstraction. A read transaction

takes 2 clock cycles (one for request, the other for

acknowledgement), while a write transaction costs only one

cycle. Arbitration is only necessary on the bus. , and it does

not cost any clock cycles in the transaction-accurate level

simulation. However, we do add one more clock cycle for

arbitration for the sake of more accurate high-level power

simulation.

3.1.6. NI: The basic transmitting unit in the OCCN network

library is called ―Protocol Data Unit‖ (PDU), which consists

of both the control header and the data body.

Communication occurs between a pair of master port and

slave port inside different NI’s. Two types of NI are

available in our architecture: the master NI as the traffic

initiator and the slave NI as the traffic target. A slave NI is

accessed by a master NI using network addresses (see

section 5.1 for details). A master NI has two data

connections to reach its neighbour slave NI on the right, as

shown in Fig. 1: either through a bus or a direct P2P link.

The routing decision is based on where the data is kept, i.e.

in the streaming addresses (mapped to the P2P link) or the

non-streaming ones (mapped to the bus). This strategy will

not only speed up the regular burst-transfer between two

cores, but also provide a flexible communication. Another

function of an NI is to synchronize the transaction-level

simulation in the OCCN network and the cycle-accurate

level simulation in the I/O tiles. Also it supports the data

format transformation between PDU and the signals within

the I/O tiles. Both the master NI and the slave NI must be

capable to handle the communication between the I/O tiles

with different data bus widths. In our case, data larger than

the network data width (i.e. 16-bit) are broken into multiple

16-bit words by the master NI and then sent to the slave NI

(shown in Fig. 3) in sequence. The slave NI receives the data

segments and assembles them to the complete data.

3.2. Multi-process programming model

The IP-based design can improve not only the IP

hardware but also the dedicated software which is optimized

for the hardware. Softwares running on different IP cores are

taken as individual processes. A universal problem is how to

enable the communication among different processes. In our

case, the data in a process is divided into two groups: the

private data and the shared data. The communication

between the different processes can only take place with the

shared data. How to address the shared data and how to

synchronize the accesses to the shared data are the topics in

this section. Core

address

Local

address

External

address

Private

address

Shared

address

Stream

address

Non-stream

address

Fig. 4. Address hierarchy

3.2.1. Process communication: The starting point is the

address mapping. Two types of addresses are defined from

the IP core’s point of view as shown in Fig. 4: (1) the local

address and (2) the external address. The first one is mapped

to the IP core’s local DM, while the latter is mapped to the

external DMs. The local address space can be divided further

as: (1) the private address which is only addressable to its

local core, and (2) the shared address which can be accessed

by all cores. Hence, the shared data must physically reside in

the shared address. A core can access the shared data, which

is not located in its local memory, through its external

address. A two-step address translation is employed here.

First of all, the master NI translates the IP core’s external

address into the network address so that the target slave NI

can be found in the network. In the second step, the target

slave NI translates the network address to the shared

memory address and puts it on the address bus in the I/O tile.

By doing this, the shared data can be transferred among

different I/O tiles. As we mentioned before, there exist two

data connections from a master NI to its neighbour slave NI.

In our case, a number of network addresses, which are

available for the master NI, are specified as the stream

addresses. Those data located in these stream addresses will

be exchanged through the P2P connection. Those non-stream

addresses would be mapped to the address space of the bus

communication.

Fig. 5. Pseudo codes for the P/V synchronization example

3.2.2. Process synchronization: Before transferring the

shared data, synchronization between processes is necessary.

Two types of software synchronization are employed on our

platform. The first one is based on the P/V primitives [20].

Two semaphores are defined as:

Start semaphore: it indicates whether a core is started.

Busy semaphore: it indicates whether a core is ready

for transferring shared data.

Fig. 5 shows the pseudo codes of the P/V synchronization. A

core can only start processing after it is informed by the start

semaphore. After processing is complete, it is blocked until

the busy semaphore is released. Then it will write the

processing results to the destination memory. The second

type employs FIFO-based synchronization [21]. The core

acting as the traffic initiator can only write data to a FIFO

when it is ―not full‖, while the core acting as the traffic

target can only read data from a FIFO when it is ―not empty".

This type of synchronization provides the opportunity to

make use of the data pipelining. But a side-effect is also

significant: it has to load both the read/write pointers every

time before it can access the FIFO.

4. DYNAMIC POWER MODEL

On the architecture level, an electronic system’s dynamic

energy can be defined as the sum of the dynamic energy

consumed by all its components. The power dissipation on

the wires is assumed zero. From this, we can derive the

general expression for the system dynamic power as:

,

,,

sys dynm

sys dynmapp dynm

EP

t

where,sys dynmP and

,sys dynmE are the system’s average

dynamic power and the total dynamic energy, respectively.

,app dynmt is the total time of completing the application (e.g.

duty time). This power model is at high abstraction level and

not necessarily accurate, e.g. the leakage power dissipation

during the idle time is not included. However, it can be still

used to predict the power trends within different

architectures. This is sufficient in the early design stage.

Our architecture consists of the cores, the I/O tiles and the

networks. So its dynamic energy is the sum of the dynamic

energy contribution of all three types:

, , , ,1 1 1

( ) ( ) ( )core io networkNN N

sys dynm core dynm io dynm network dynmi j k

E E i E j E k

in which coreN is the total number of IP cores in the system.

, ( )core dynmE i denotes the total energy of the ith IP core.

Similar meanings hold for the other terms. It should be

pointed out that ioN might be larger than coreN when a

standalone shared memory is presented on the bus. Details

about each term in the equation above will be explained in

the next sections.

4.1. IP core dynamic energy model

An IP core’s dynamic power model on high abstract level

is usually a constant, which is the mean dynamic power

number taken from either the layout back-annotated power

simulation or the silicon measurements. In this case, its

dynamic energy is defined as:

, , ,core dynm core dynm core dynmE P t (1)

where ,core dynmP is the core’s dynamic power number and

,core dynmt is its active time.

Due to the software synchronization, an IP core can be

active even though it is not processing the data. This is due

to the absence of DMA in our current system. The core’s

active time are divided into three phases: the synchronization

phase, the memory-transfer phase and the computation phase.

Equation (1) then becomes:

, , ,, ,( )core sync core mem core compcore dynm core dynm

E P t t t

(2)

This three-phase classification can help us identify whether

the energy bottleneck is in computation or not. It can also

indicate whether additional hardware components, like

DMA or interrupter controller, are needed in the architecture.

4.2. I/O tile dynamic energy model

Our proposed I/O tile consists of the following modules:

program memory (PM), data memory (DM), arbiter (AR),

de-multiplexer (DMUX) and master/slave network interface

void proc() begin // Wait for start semaphore P(start_semph); V(start_semph); //Start intra-process processing … // Wait for unlocking P(busy_semph); V(busy_semph); // Start inter-process communication … end

(m_NI/s_NI). A general dynamic energy model can be

expressed as:

, , ,

, ,

_ , _ ,

io dynm pm dynm dm dynm

ar dynm dmux dynm

m ni dynm s ni dynm

E E E

E E

E E

(3)

The first row in equation (3) addresses the dynamic

energy consumed by memories. The second row in equation

(3) denotes the dynamic energy used by the arbiter and the

de-multiplexer. The arbiter and the de-multiplexer can be

designed with only a few gates and wires in our case because

of their simple function and limited number of ports on our

platform. For that reason, their dynamic energy

consumptions are assumed zero.

In equation (3), the dynamic energy contributed by NI is

special. A network interface functions as the transceiver in

the network and as a multi-word splitter/assembler in the I/O

tile. The dynamic energy consumption, due to the first

function, is represented in the network energy model in

section 4.3, so we can ignore it here in equation (3). The

second function of NI just increases the number of memory

accesses, and its dynamic energy contribution is not as

significant as the one contributed by memory. Considering

these above, the dynamic energy contributed by NI is also

neglected. Equation (3) can then be simplified as:

, , ,io dynm pm dynm dm dynmE E E (4)

4.3. Network dynamic energy model

An energy model of the wires in the 130 nm technology

process is proposed in [22], which is resulted from the

silicon measurements:

0.27 0.58wireE l pJ (5)

in which l is the wire length ranging from 1 to 5 mm. A bus

can be considered as a collection of the wires so that its

energy can be derived from equation (5) [22]:

( 1) /bus bus bus wireE W N E pJ transaction (6)

where N is the number of the nodes connected to the bus and

busW is the number of the wires in the bus. bus is the

activity factor of the bus, ranging from 0 to 1. The dynamic

energy contribution of the bus arbiter is assumed to be

ignored here as in [22]. In a similar way, the P2P energy

model can be described as:

2 2 2 /p p p p p p wireE W E pJ transaction (7)

where 2p pW is the number of the wires per link. When the

activity factor and the number of wires are same, a P2P link

is more energy efficient than a bus.

5. WORK FLOW

A five-step work flow is introduced for our platform (see

Fig. 6), which covers three different aspects in system design:

(1) the IP-based architecture design, (2) the multi-core

programming, and (3) the performance evaluation. Generally

speaking, an IP core is a stand-alone ISS integrated into a

configurable MPSoC platform. The MPSoC simulator is

derived from this platform. The software is modified for the

simulator. Finally, the system is simulated and evaluated,

and the system designers may improve the design based on

the Y-chart methodology [23]. Each step in Fig. 6 is

explained below.

Step 1: platform configuration

The system specifications (e.g. clock, I/O width, power

number and etc) are determined from either the data sheets

or the design purposes. A list is given in table 1. The way of

inter-core communication, i.e. either the bus or the peer-to-

peer link, is specified in this stage. The IP core’s dynamic

power number is from either the data sheet or the power

simulations. A configuration script generates an MPSoC

configuration file with these specifications.

Template for

MPSoC

configuration

Configuration

script

CHESS

compiler

Executable

file

Performance

and power

System

Specification

MPSoC

platform

systemC

compiler

front-end

systemC

compiler back-

end

………...………...Modified

codes ………...………...CHECKER

instruction

set simulator

………...………...Source

codes

MPSoC

configuration

Target

architecture

Codes

rework

Target

simulator

Step1

Step2.1

Step2.2

Step3.2

Step3.1

Hardware/Software co-

simulation

Step4

Step5.2Step5.1 Step5.3

Fig. 6. Work flow

TABLE 1 SYSTEM SPECIFICATION LIST

Name Parameter

IP core I/O bus width, word size, address

mapping, power number

Memory Size, data/address width, address

mapping, arbitration, power number

Communication Type, bandwidth, address mapping, power

number

Clock IP clock, I/O clock, network clock

Step 2: hardware/software retarget

The SystemC compiler will read the configuration file

together with the MPSoC platform, resulting in the target

architecture in systemC, as shown in step 2.1. In step 2.2, the

softwares running on individual IP cores have to be modified

so that they can be a part of the multi-core application, i.e. to

declare the shared data explicitly and to implement the inter-

process synchronization.

Step 3: hardware/software compilation

The modified software in the previous step is compiled by

CHESS to have the executable files (step 3.1). ISS’s, which

have already been generated and wrapped as the systemC

classes by CHECKER, are read by systemC compiler

together with the target architecture. The target simulator is

generated as the output of the systemC compiler in step 3.2.

Step 4: hardware/software co-simulation

The target simulator loads the executable files and outputs

the simulation results, i.e. the performance and the power

numbers.

Step 5: evaluation and redesign

The numbers are evaluated and may give feedback to the

earlier steps in the flow. The system specifications, such as

the communication type, can be changed in step 5.1, while

the software coding can be improved in step 5.2. In step 5.3,

dedicated hardware accelerators which are not in IP library

can be modelled in systemC and then inserted in our

platform. By doing these, the design can be optimized.

6. LOW-POWER APPLICATION STUDY

A WSN is a typical low power design. It is featured by the

low duty cycle and is sensitive to the power/energy

constraints. In this section, we will study one of its

applications in health care and apply this application on our

proposed MPSoC platform. The research results are given in

section 7.

6.1. ECG application

Traditional WSN applications transmit the raw sensor

data to the network terminal by the RF communication. This

can lead to considerable energy consumption, since RF

device is active for a long time. ULP-DSP group at imec

Netherlands is working in such a power efficient way that

the raw sensor data would be pre-processed by DSP before

transmitted by the RF transmitter. By doing this, the amount

of transmitted data is reduced and considerable energy is

saved. The roadmap shows that a power consumption of 20

uW will be realized on an existing signal processing

platform [24].

One typical application of the research in ULP-DSP group

is the remote electrocardiagram (ECG) sampling on human

heartbeats. The sampled heartbeats are analyzed every 3

seconds, and then the ECG signals are transmitted to the

remote terminal. Three scenarios will occur in this

application.

Scenario 1: normal heartbeats are detected and a small

amount of pre-processed data (i.e. 128 bytes) needs to

be transmitted.

Scenario 2: irregular heartbeats are detected so that all

raw ECG data (i.e. 3072 bytes) have to be transmitted.

Scenario 3: it is the minor case that the remote terminal

reconfigures the complete system through the wireless

network.

The second scenario interests us the most, because it

represents the worst case of the power consumption. Once its

power constraint is satisfied, the power constraints in the

other two scenarios should be met. We employ synchronous-

data flow (SDF) [25] to have a better insight in it.

In Fig. 7, actor A (or sensor) fires 3072 tokens (i.e. 3072

bytes) on its output edge every 3 seconds. The other nodes

consume the tokens on their input edges at constant rates

(e.g. 3072 bytes for node C). Communication latency is

assumed to be zero in Fig. 7, which is the ideal condition.

Flow latency should not exceed 3 seconds in the ECG

application:1 2 3

3072 3sec3072 3072r r r , in which

xr denotes the node’s processing rate.

3 sec r1

r2r3

A: SensorB: Digital signal

processing

C: Plain-text ciphering

D: RF transmitting

3072

30723072

3072

30723072

Fig. 7. SDF for scenario 2

6.2. Hardware specification

Three dedicated ASIPs are employed on our platform,

which are developed individually with TargetCompiler ASIP

design tool:

1) Biomedical processor: it processes the digital signals

from the bio-sensor.

2) Crypto processor: it ciphers the ECG data for the sake

of the privacy protection.

3) Ultra-wide-band processor: it is optimized for the

radio transceiving.

By default, these three processors can be individually

power-gated to reduce the leakage current. Their major

specifications are listed in table 2. The data memories must

hold sufficient space to keep both the private data and the

shared data (see table 3). Both the P2P and the bus channels

are supposed to be feasible in the target system, which can

serve in different application scenarios. Because that the P2P

consists of only wires and that a bus implementation is more

complex, it is assumed that the P2P bandwidth (i.e. 16 bits in

10 ns) is twice as large as the bus bandwidth(i.e. 16 bits in

20 ns), as listed in table 4.

Now the target triple-core architecture can be determined

based on those specifications above, as shown in Fig. 8. It

should be pointed out that the processor block (in blue) in

Fig. 8 is a combination of the IP core and its I/O tile. A

standalone shared memory (128 KB, 16-bit wide) is reserved

in Fig. 8, which is special for the minor scenario of the

system reconfiguration. The bus, the P2P and the standalone

shared memory are disabled when they are not required.

http://www.picotech.com/applications/ecg.html

The dynamic power numbers of ASIP and data memory

are obtained from the power simulations and the data sheets,

with the TSMC 90 nm technology process. The power

numbers of bus/P2P are derived from equation (6) and (7), in

which the activity factor is assumed to be 0.5 and the wire

length to be 1 mm. Being aware that it was derived in the

130 nm technology process, the wire energy model in

equation (5) has to be scaled down by a factor of 1.44 [26]

for our 90 nm technology process.

TABLE 2

ASIP SPECIFICATION

TABLE 3

DATA MEMORY SPECIFICATION

TABLE 4

NETWORK SPECIFICATION

OCCN bus

Bio180

processorCrypto processor

UWB

processor

Shared

memory

OCCN P2P OCCN P2P

Traffic 1 Traffic 2

Initiator 1 Target 1/Initiator 2 Target 2

Fig. 8. Overview of target architecture

6.3. Software specification

The application mapping determines where the shared

data is allocated. In our case, static mapping is preferred

since each ASIP is optimized for dedicated application(s)

(see section 6.2). In the rest of this paper, the application

mapping would not be changed.

After mapping the applications on the ASIPs, the network

address (see section 3.2) must be determined for the shared

addresses. A global address map starting from zero is

applied, which maps the network addresses to the data

memories in an ascending order: the standalone bus shared

memory occupies the first section of the global addressing

space, followed by the Biomedical processor’s local memory,

while the Crypto and the UWB processor reserve the third

and the forth segmentations, respectively. The total size of

the global addressing space is the number of the available

shared addresses. Now the ASIPs have access to the shared

addresses in other memories by the corresponding network

addresses.

The physical locations of the shared data/synchronizers

must be determined (e.g. in which memory they reside), too.

In our case, the shared data and their synchronizers are

physically stored in the shared address of the traffic target,

and they can be accessed by the traffic initiator through its

external address. The Biomedical processor is the initiator

and the Crypto processor is the target in the traffic between

them, while the Crypto processor is the initiator and the

UWB processor is the target in the traffic between them (see

Fig. 8).

By default, these three ASIPs are operated in an ordered

way as shown in Fig. 9. The duty time of the sequential

program is the sum of the execution time in three ASIPs,

which must not exceed the data analysis period (3 seconds).

Active

Active

Active

Biomedical

processor

Crypto

processor

UWB

processor

Time

Duty time

Active

Fig. 9. Sequential program

6.4. Power model for ECG application

In Section 4, a dynamic power model for MPSoC

platform has been defined. However, it is too limited for our

ECG application: since the future tape-out will employ the

90 nm technology process, the impact of leakage current

must be taken into account in the 90 nm technology.

Leakage power can be as significant as the dynamic one.

Due to this reason, an extended power model for ECG

application is proposed as:

, ,

0

N

ECG dynm i leak i

iECG

E P t

PT

(8)

in which ,ECG dynmE and ,i leakP are the system’s dynamic

energy and the ith component’s leakage power, respectively.

it is the duty time of the ith component and N is the total

number of the components in our architecture. T is the data

analysis period (i.e. 3 seconds).

Equation (8) indicates that the leakage energy is strongly

related to the duty time. Once the frequency scaling is

applied to reduce the dynamic power, the application duty

time definitely increases, resulting in an increase of leakage

Processor I/O data

width

(bit)

Typical

clock

(MHz)

Dynamic

power

(mW)

Leakage

power

(uW)

Biomedical 32 100 5.135 6.45

Crypto 16 100 7.08 4.71

UWB 16 100 1.2368 0.792

No. Local IP core Data width (bit) Size (KB)

1 Biomedical 32 32

2 Crypto 16 16

3 UWB 16 16

Name Clock

(ns)

Latency

(ns)

Data width

(bit)

Dynamic energy

(nJ/trans)

Bus 20 2 16 43.35

P2P 10 1 16 14.45

energy. So it seems difficult to reduce the system power with

the frequency scaling. But we argue that the leakage energy

can also go down, when the operational clock frequency has

scaled down. Actually, the leakage energy is caused by the

leakage current which is related to the voltage supply. A low

frequency clock can run well with low voltage supply. So

the leakage energy increment caused by a longer duty time

can be compensated by a lower voltage supply.

Since the dynamic power interests us only in this paper,

the leakage energy will be excluded in our power simulation.

However, we will take care of it by keeping the duty time

within a reasonable range. Considering this, the Energy-

Delay Product (EDP) is employed as one of our metrics.

Only those systems with a low EDP can be our candidates.

In this paper, the term energy and delay refer to the

dynamic energy and the duty time, respectively. They would

not be distinguished in the later sections, except for the

explicit declarations.

TABLE 5 APPLICATION MAPPING

6.5. Power simulation strategy

Our power simulation aims at predicting the power trend

of the ECG application in the early stage. Considering this,

five assumptions are made to ease our power simulation.

1) All modules on our platform are individually power

gated3. The IP cores are also power-gated when they

access the external memories.

2) The power consumption caused by those memories

other than the data memories is ignored4. The voltage

supply is fixed at 1.2 V, which allows for a 100 MHz

clock frequency.

3) The IP core and its I/O tile should be operated at the

same frequency for the sake of simplicity. The bus

and P2P clock periods are fixed in our simulation (see

table 4).

4) Each IP core can scale its frequency individually.

5) The three ASIPs run their own softwares as shown in

table 5. Each ASIP must be in one of the following

four states during the simulation: idle, active,

synchronization or memory-transfer. An ASIP

waiting for the starting signal is in its idle state and is

assumed to be power gated. Once it is started, it

enters the active state until it needs access to the

2 It is replaced with the application of low-pass filter, due to

it is still under development at the moment. 3 It has been already developed by the ULP-DSP group,

although it is not presented in our platform. 4 It is from the unpublished test report in the ULP-DSP

group.

external memories. Then it is blocked in the

synchronization state until it is authorized to access

the target external memory. In the memory-transfer

state, it moves the shared data from one memory to

the other like a physical DMA hardware does. The

activities in different states are recorded in a log file

for high-level power calculations afterwards.

Four test-benches are specified for the power simulations,

which are shown in Table 6. They are discussed in the next

section.

TABLE 6 TEST-BENCH SPECIFICATION

TABLE 7

BENCHMARK SPECIFICATION

TABLE 8 BENCHMARK PROFILE

Name Dynamic energy (nJ) Duty time (ms)

Data path 55875.48 4.15

Memory 525 0.82

Total 56400 4.15

Dynamic power (mW): 13.6

7. EXPERIMENTS

In this section, the power simulations of ECG application

are discussed. Four test-benches are involved in our

simulation and evaluated by EDP. Section 7.1 gives some

basic facts about the benchmark. Then test-bench 1 and 2 are

profiled in section 7.2, and test-bench 3 and 4 in section 7.3.

The local frequency scaling and the global frequency scaling

are discussed in sections 7.4 and 7.5, respectively. A

summary of our experiments is given in section 7.6.

7.1. Benchmark

An imaginary processor is considered as our benchmark,

whose data path is made up of three ASIPs (i.e. Biomedical,

Crypto and UWB processor). Its hardware specifications are

shown in table 7. It runs the biomedical processing, the

plain-text ciphering and the RF transceiving successively,

and then is power-gated until the next data analysis period.

The dynamic energy and the duty time are given by the sum

of all ASIP dynamic energy and duty time, respectively. The

Processor DSP Plain-text

Cipher

RF

transmit2

Biomedical

Crypto

UWB

No. Bus P2P Sequential

Program

Software

pipeline

1

2

3

4

I/O data

width

(bit)

Typical

clock

(MHz)

Dynamic

power

(mW)

Leakage

power

(uW)

Local

Memory

size (KB)

32 100 13.4518 11.952 32

data path consumes the most dynamic energy (55875 nJ) in

the full system’s dynamic energy (56400 nJ), as shown in

table 8.

7.2. Sequential programming

The sequential program is running in both test-bench 1

and 2, as indicated in table 6. The dynamic energy profile is

shown in Fig. 10. Bus communication in test-bench 2

consumes more energy (92690 nJ) than the P2P

communication in test-bench 1 (16357 nJ). Although it is not

dominant in test-bench 1, communication cost in test-bench

1 is as high as 5.32 nJ/byte (16357 nJ for 3072 bytes), and

the computation cost in test-bench 1 is just 8.77 nJ/byte

(25851.8 nJ for 3072 bytes). The energy contributed by the

synchronization/memory-transfer states and data memories

is too low to be seen in Fig. 10.

Fig. 10. Energy profile of test-bench 1 and 2

Fig. 11. Duty time profile of test-bench 1

The duty time in test-bench 1 (i.e. 5.07 ms in Fig. 14) is

smaller than that in test-bench 2 (i.e. 5.14 ms in Fig. 14), due

to two reasons: (1) the bus is slower than P2P (see table 4)

and (2) the concurrent requests from the multiple IP cores

are handled in parallel by multiple P2P links.

Fig. 11 shows that ASIPs in test-bench 1 spend limited

time on the communication when they are in the

synchronization state and the memory-transfer state. The

crypto processor processing (i.e. the Crypto active state) is

dominant in the complete processing, which is about 1.6

times and 2.5 times larger than the Biomedical processor and

the UWB processor, respectively.

7.3. Software pipelining

Software pipelining, as shown in Fig. 12, is applied on

test-bench 1 and 2. The test-benches which implement the

software pipelining (i.e. test-bench 3 and 4) consume more

dynamic energy than those without the pipelining (i.e. test-

bench 1 and 2), as shown in Fig. 13. This is due to the

synchronization overheads caused by the software pipelining.

The P2P connected architecture (test-bench 4) wins over the

bus connected one (test-bench 3) in the energy saving,

because the P2P communication has a lower energy

consumption per transaction than the bus communication

does. However, the duty time of the test-benches with the

software pipelining is decreased by nearly half in Fig. 14. A

detailed profile of the energy contribution of test-bench 4

(see Fig. 15) discloses that the synchronization accounts for

most energy consumption.

Active

Active

Active

Active

Sync.

Active

Active Active Sync. Active Sync.Biomedical

processor

Crypto

processor

UWB

processor

Time

Duty time Fig. 12. SDF of software pipeline

Fig. 13. Dynamic energy in test-benches

Fig. 14. Duty time in test-benches

25851.8

16357.2

25852

92690.6

0

20000

40000

60000

80000

100000

Energy (nJ)

Test-bench 1

Test-bench 2

1486050

53940

2402070

147840979270

Duty time (ns)Biomedical processor: active

Biomedical processor: synchronization

Biomedical processor: memory transferCrypto processor: active

Crypto processor: synchronization

Crypto processor: memory transfer

UWB processor: active

UWB processor: synchronization

UWB processor: memory transfer

44087.4120427

768506.6667

117805.6667

0

200000

400000

600000

800000

1000000

Test-bench 1 Test-bench 2 Test-bench 3 Test-bench 4

Energy (nJ)

5.07E+06 5.14E+06

2.80E+06 2.69E+06

0.00E+00

1.00E+06

2.00E+06

3.00E+06

4.00E+06

5.00E+06

6.00E+06


Duty time (ns)

Fig. 15. Energy contribution in test-bench 4

Fig. 16. Energy-Delay Product in test-benches

3000

ms

1.54

ms

2.55

ms

0.979

ms

SensorBio-processor

Crypto processor

UWB processor

3072

30723072

3072

30723072

Fig. 17. SDF for test-bench 4

7.4. Local frequency scaling

Slowing down the Biomedical processor in test-bench 4

can reduce the number of synchronizations between the

Crypto processor and the Biomedical processor. Local

frequency scaling is one way to do that. The proper scaling

factor can be determined by SDF analysis. In Fig. 17, the

node processing rates are derived from the profile of test-

bench 1 (sequential programming). A relation between

different clock frequencies can now be determined from

SDF, as:

1.7 2.6cry bio uwbf f f (9)

in which xf is the clock frequency of ASIP. The net-list of

the Crypto processor is optimized for 100 MHz and a higher

clock would violate the optimized circuit structure. The

Crypto processor processing is the bottleneck of the entire

execution time, so that the Crypto processor must run as fast

as possible. For these two reasons, the Crypto processor’s

clock frequency is fixed at 100 MHz. Design space

explorations of the ASIP clock frequency are then executed

in the following ranges: the Biomedical processor clock

within 50~100 MHz and the UWB processor clock within

30~100 MHz.

The Biomedical processor clock is explored first, as

shown in Fig. 18. The horizontal axis is the duty time and

the vertical axis is the dynamic energy. Four Pareto points

are identified (the red circles in the figure). They are the

clock periods of 10, 17, 18 and 19 ns, respectively. After

checking their EDP shown in Fig. 19, the clock period 18 ns

with EDP 121.97 is chosen as the most efficient in

performance-power trade-off.

The UWB processor can also be slowed down, because its

processing time is shorter than the duty time. Such a high

speed is a waste of dynamic power. Fig. 20 shows the DSE

results. One Pareto point (27 ns) is found in Fig. 20. Its EDP

is about 121.96 shown in Fig. 21.

By now, the optimized ASIP clock frequency in test-

bench 4 is:

Biomedical processor at 55 MHz.

Crypto processor at 100 MHz.

UWB processor at 37 MHz.

Although the duty time increases slightly from 2.69 ns to

2.75 ns (see table 9), the energy reduces as much as 63%

after local frequency scaling.

Local frequency scaling is unnecessary for test-bench 1,

because test-bench 1 runs the sequential program without

synchronization overheads. Nevertheless, the clock

frequencies in test-bench 1 are scaled down to those in test-

bench 4 for comparison. By doing so, the energy in test-

bench 1 stays the same while the duty time climbs up by

about 1ms after scaling, as shown in table 10.

Fig. 18. DSE of Biomedical processor clock frequency

Fig. 19. Energy-Delay product in test-bench 4 after

Biomedical processor frequency scaling

Energy (nJ) Biomedical processor: active

Biomedical processor: synchronization

Biomedical processor: memory transfer

Crypto processor: active

Crypto procesor: synchronization

Crypto processor: memory transfer

UWB processor: active

UWB processor: synchronization

UWB processor: memory transfer

Data Memory

Synchronization: Biomedical-Crypto

Memory transfer: Biomedical-Crypto

Synchronization: Crypto-UWB

Memory transfer: Crypto-UWB

223.55

618.97

2150.04

317.15

0.00

500.00

1000.00

1500.00

2000.00

2500.00


Energy-Delay Product (nJ-sec)

0

20000

40000

60000

80000

100000

120000

140000

2.60E+06 2.70E+06 2.80E+06 2.90E+06 3.00E+06 3.10E+06

317.15

176.60151.98

129.36 121.97 121.99 136.26

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

Energy-Delay Product (nJ-sec)

10ns

18ns

19ns 17ns

Fig. 20. DSE of UWB processor clock frequency

Fig. 21. Energy-Delay product in test-bench 4 after UWB

processor frequency scaling

TABLE 9 PROFILE OF TEST-BENCH 4

Dynamic

energy (nJ)

Duty

time (ms)

Dynamic

power (mW)

Before scaling 117805.67 2.69 43.79

After scaling 44296.3 2.75 16.1

TABLE 10 PROFILE OF TEST-BENCH 1

Dynamic

energy (nJ)

Duty

time (ms)

Dynamic

power (mW)

Before scaling 44087.9 5.07 8.69

After scaling 44087.9 7.09 7.1

TABLE 11 PROFILE FOR POWER-CONSTRAINT SCENARIO

Dynamic

power (uW)

Duty time

(ms)

Global scaling

factor

Benchmark 20 2822 680

Test-bench 1 20 2201 355

Test-bench 4 20 2213 805

TABLE 12 PROFILE FOR DUTY-TIME-CONSTRAINT SCENARIO

Dynamic

power (uW)

Duty time

(second)

Global scaling

factor

Benchmark 18.9 3 723

Test-bench 1 14.7 3 484

Test-bench 4 14.8 3 1091

TABLE 13 CLOCK FREQUENCY FOR 20UW CONSTRAINT

Biomedical

(kHz)

Crypto

(kHz)

UWB

(kHz)

Test-bench 1 160 282 105

Test-bench 4 70 125 46

Benchmark 148

TABLE 14 CLOCK FREQUENCY FOR 3 SECONDS CONSTRAINT

Biomedical

(kHz)

Crypto

(kHz)

UWB

(kHz)

Test-bench 1 114 207 77

Test-bench 4 51 92 34

Benchmark 139

7.5. Global frequency scaling

To meet the power budget of the ECG application (i.e. 20

uW [24]), the global frequency scaling is applied on the

benchmark, test-bench 1 and 4. Table 11 lists the results

after the global scaling. The duty time in the benchmark is

almost 3 seconds, while those in test-bench 1 and 4 are

around 2 seconds.

The global frequency scaling can be also applied in the

duty time constraint scenario. For example, the duty time is

set as long as the data analysis period, as shown in table 12.

Then both test-benches can have the dynamic power

(14.7~14.8 uW) as much as 22% lower than the benchmark

(18.9 uW).

Table 13 as well as table 14 give the clock frequencies

after local/global scaling in the two scenarios above. They

suggest that test-bench 4 can be supplied with a much lower

voltage than the others, because of its slower clock

frequencies.

7.6. Summary

According to the simulation results above, three facts are

observed:

1) The bus communication is more expensive than the

P2P communication for the ECG application on our

proposed platform (see Fig. 10).

2) The P2P communication with the software pipelining

has a lower duty time and power consumption than our

benchmark in the power-constraint scenario and the duty-

time-constraint scenario, respectively (see table 11 and table

12).

3) The P2P communication with the software pipelining

can benefit from a low voltage supply due to its lower clock

frequencies (see table 13 and table 14).

8. CONCLUSIONS AND FUTURE WORK

In our work, an MPSoC platform and its work flow are

proposed. This platform consists of three subsystems: the IP

44296

44298

44300

44302

44304

44306

44308

44310

2.75E+06 2.75E+06 2.76E+06 2.76E+06 2.76E+06

121.97 121.96 121.96 121.96 121.96 121.96 121.96121.99

122.04

122.50

121.60

121.70

121.80

121.90

122.00

122.10

122.20

122.30

122.40

122.50

122.60

Energy-Deay Product (nJ-sec)

27ns

core, the I/O tile and the OCCN bus/P2P. The IP core

derives from the systemC wrapped ISS, and the I/O tile is

implemented in systemC. The OCCN bus/P2P is adopted

from the OCCN library. A multi-process programming

model is developed on this platform, in which the process

communication is based on the memory address mapping

and the process synchronization is implemented in the

software. An architecture-level dynamic power model is also

developed for this platform, which favours the power trend

prediction in the early design stage. In the work flow, our

platform is configured with the system specifications (e.g. IP

core clock frequency). A target simulator is generated in the

work flow, which can deliver the cycle-accurate timing

information and the power consumption numbers. The

simulation results can help improving the hardware/software

design. Although ISS’s developed by TargetCompiler are

employed throughout our paper, other ISS’s with the

memory interfaces can be integrated into our platform, too.

The ECG application has been shown on our platform.

Three ASIPs developed in imec Netherlands are employed.

Their ISS’s and softwares are integrated into our platform.

After the high-level power simulation, we find that P2P

communication with the software pipelining can satisfy both

the power-constraints and the time-constraints in ECG

scenarios. The P2P communication with the software

pipelining possibly benefits low leakage power, too.

In our proposed work flow, the software engineers

manually allocate the memory addresses for the shared data

to enable the inter-core communication (i.e. step 2.2 in Fig.

6). In fact, these memory addresses for the inter-core

communication are determined in the MPSoC configurations.

Once the IP core’s role in the traffic (i.e. traffic initiator or

target) is specified and its shared data are explicitly declared,

a smart compiler/linker can map the shared data at the proper

memory addresses of IP core. To achieve this goal, a

dedicated linker with special keywords can be developed in

the future.

ACKNOWLEDGMENT

This paper is part of the author’s master project in imec

Netherlands. Lots of thanks go to Prof. Henk Corporaal, Dr.

Mario Konijnenburg, Mr. Jos Huisken, Mr. Firew Siyoum

and the people in ULP-DSP group for their intensive

guidance.

REFERENCES

[1] H. Long, Y. Liu, Y. Wang, R.P. Dick, and H. Yang, ―Battery

allocation for wireless sensor network lifetime maximization under

cost constraints,‖ Proceedings of the 2009 International Conference

on Computer-Aided Design, San Jose, California: ACM, 2009, pp.

705-712.

[2] Gert Goossens, Dirk Lanneer, Werner Geurts, and Johan Van Praet,

―Design of ASIPs in Multi-Processor SoCs using the

Chess/Checkers Retargetable Tool Suite,‖ International Symposium

on System-on-Chip (SoC 2006), Tampere: 2006.

[3] Simon Jäger, ―A Simulation Framework for Multiprocessor SoC,‖

Student Thesis SA-2003.15, Winter Term 2002/2003, ETHz, 2003.

[4] R. Zhong, Y. Zhu, W. Chen, M. Lin, and W.F. Wong, ―An Inter-

Core Communication Enabled Multi-Core Simulator Based on

SimpleScalar,‖ 21st International Conference on Advanced

Information Networking and Applications Workshops, 2007,

AINAW'07, 2007.

[5] D. Burger and T.M. Austin, ―The SimpleScalar tool set, version 2.0,‖

ACM SIGARCH Computer Architecture News, vol. 25, 1997, pp.

13–25.

[6] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M.

Poncino, ―Legacy SystemC co-simulation of multi-processor

systems-on-chip,‖ Computer Design: VLSI in Computers and

Processors, 2002, pp. 494–499.

[7] M. Yu, J. Song, F. Fu, S. Sun, and B. Liu, ―A Fast Timing-Accurate

MPSoC HW/SW Co-Simulation Platform based on a Novel

Synchronization Scheme,‖ Proceedings of the International

MultiConference of Engineers and Computer Scientists, vol. 2,

2010.

[8] N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and

S.K. Reinhardt, ―The M5 simulator: Modeling networked systems,‖

IEEE Micro, vol. 26, 2006, pp. 52–60.

[9] V. Joloboff and C. Helmstetter, ―SimSoC: A SystemC TLM

integrated ISS for full system simulation,‖ 2008.

[10] A. Rose, S. Swan, J. Pierce, and J.M. Fernandez, ―Transaction level

modeling in SystemC,‖ Open SystemC Initiative, 2005.

[11] A. Hansson, K. Goossens, M. Bekooij, and J. Huisken, ―CoMPSoC:

A template for composable and predictable multi-processor system

on chips,‖ ACM Transactions on Design Automation of Electronic

Systems (TODAES), vol. 14, 2009, p. 2.

[12] H.N. Nikolov, ―System-level design methodology for streaming

multi-processor embedded systems,‖ Leiden Institute of Advanced

Computer Science (LIACS), Faculty of Science, Leiden University,

2009.

[13] M. Monchiero, R. Canal, and A. González, ―Design space

exploration for multicore architectures: a

power/performance/thermal view,‖ Proceedings of the 20th annual

international conference on Supercomputing, 2006, p. 186.

[14] J. Xu, Y. Zhu, L. Jiang, J. Ni, and K. Zheng, ―A simulator for multi-

core processor micro-architecture featuring inter-core

communication, power and thermal behavior.‖

[15] D. Brooks, V. Tiwari, and M. Martonosi, ―Wattch: a framework for

architectural-level power analysis and optimizations,‖ ACM

SIGARCH Computer Architecture News, vol. 28, 2000, p. 94.

[16] A. Kahng, B. Li, L.S. Peh, and K. Samadi, ―Orion 2.0: A fast and

accurate noc power and area model for early-stage design space

exploration,‖ Design, Automation, and Test in Europe, 2009, pp.

423–428.

[17] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P.

Jouppi, ―McPAT: an integrated power, area, and timing modeling

framework for multicore and manycore architectures,‖ Proceedings

of the 42nd Annual IEEE/ACM International Symposium on

Microarchitecture, 2009, pp. 469–480.

[18] H. Bajwa and X. Chen, ―Low-Power High-Performance and

Dynamically Configured Multi-Port Cache Memory Architecture,‖

2007 International Conference on Electrical Engineering, Lahore,

Pakistan: 2007, pp. 1-6.

[19] M. Coppola, S. Curaba, M.D. Grammatikakis, G. Maruccia, and F.

Papariello, ―OCCN: a network-on-chip modeling and simulation

framework,‖ Proceedings of the conference on Design, automation

and test in Europe-Volume 3, 2004, p. 30174.

[20] F.L. Bauer, H. Wössner, and H. Wössner, Algorithmic language and

program development, Birkhäuser, 1982.

[21] ―Circular buffer,‖ Wikipedia, the free encyclopedia, Aug. 2010.

[22] J. Dielissen, A. Radulescu, and K. Goossens, Power measurements

and analysis of a network on chip, Eindhoven, , The Netherlands:

Philips Research, 2005.

[23] P. Lieverse, P. Van Der Wolf, K. Vissers, and E. Deprettere, ―A

methodology for architecture exploration of heterogeneous signal

processing systems,‖ The Journal of VLSI Signal Processing, vol.

29, 2001, pp. 197–207.

[24] (2010) The HolstCenter website. [Online]. Available:

http://www.holstcentre.com/

[25] B.D. Theelen, M.C.W. Geilen, T. Basten, J.P.M. Voeten, S.V.

Gheorghita, and S. Stuijk, ―A scenario-aware data flow model for

combined long-run average and worst-case performance analysis,‖

Fourth ACM and IEEE International Conference on Formal

Methods and Models for Co-Design, 2006. MEMOCODE'06.

Proceedings, 2006, pp. 185–194.

[26] M. Pedram and J.M. Rabaey, Power aware design methodologies,

Springer, 2002.

MPSoC Platform Design and Simulation for Power %0A Performance Estimation

Documents