MPSoC Platform Design and Simulation for Power Performance Estimation Zhengjie Lu Eindhoven University of Technology, Electrical Engineering Department Den Dolech 2, Postbus 513, 5600 MB, Eindhoven, The Netherlands [email protected]Abstract—Wireless sensor nodes (WSN) play an important role in future intelligent applications, such as remote medical examination and unattended industry-field monitoring. Since such a device is usually powered by a battery, the design trade-off between the performance and the power consumption is critical in achieving maximum battery life. A configurable platform with the performance and high-level power simulations is essential to determine the feasible designs in the early design stage. In this paper, we present such an MPSoC platform and its dynamic power model. A work flow to enable the design automation with the help of CHESS and CHECKER 1 is provided. An ECG beat-detection application is implemented on our platform as a case study. The simulation results predict that the P2P communication architecture with the software pipelining is optimal on an ECG MPSoC platform in both the power-constraint scenario and the time-constraint scenario. Index terms—MPSoC platform, dynamic power model, work flow, low power application, high-level power simulation. 1. INTRODUCTION Wireless sensors are normally constrained by a limited power budget, which makes a low-power design essential [1]. A favoured approach is to reuse the dedicated hardware which is optimized for the low power application and distributed as the Intellectual Property (IP). A full-system power simulation is then necessary to determine the most energy efficient configuration. Power estimation is widely accepted in industry and it offers promising accuracy. Unfortunately, its simulation speed can decrease dramatically as the system becomes complex (e.g. a triple-core processor). This is not efficient in the early design stage. In fact, a rough estimation on power consumption is good enough in the early design stage to compare different configurations. For this reason, a higher- level power simulation without high accuracy is acceptable as the starting point of the system design. Instruction-set simulation (ISS) is a kind of high level performance simulation, which provides the cycle-accurate profiles of a single processor with adequate fast simulating speed. Both the number of active cycles and the energy consumption of function units are collected during the 1 CHESS and CHECKER are both commercial compilers from TargetCompiler N.V. simulation. To support the multi-core simulation, ISS’s are embedded within a co-simulation environment. The energy of a full system is the sum of all ISS active energy plus inter- ISS communication energy. However, different ISS’s hold different I/O interfaces to the rest of the system and the alternation in networked architecture requires reworking on the communication interface. The group of ULP-DSP (Ultra-Low Power Digital Signal Processing) at imec Netherlands [24] is developing the ultra low power WSN for health care and industrial monitoring. A target multi-core system employs application-specified instruction processors (ASIPs). The ASIP hardware is developed with TargetCompiler Designer Tool [2], which is appropriate for low power applications. The software running on such an ASIP is compiled by the CHESS compiler [2]. Besides, ISS for dedicated ASIP is generated by CHECKER [2], which generates the profiles of software execution and hardware usage. Three aspects can be improved in the work flow of TargetCompiler. First of all, the hardware/software co- design for multi-core is inadequately supported. An ASIP and its accompanying software are designed from a single- core’s point of view. Questions about how these ASIPs are interconnected and what the optimal networked architecture is are not optimally supported in the current work flow [2]. Secondly, a convenient programming model for multi-core applications in TargetCompiler workflow is not provided. Thirdly, the power simulation is not easy due to the lack of power models. A flexible multi-processor system-on-chip (MPSoC) platform with high-level power models is set up in this paper, improving the work flow of TargetCompiler. It enables the design space exploration on the performance and the power estimation. This paper is organized as follows. Section 2 introduces the related work and our contribution. An MPSoC platform and its power model are described in section 3. The power models are intensively studied in section 4. A detailed explanation of our proposed work flow is given in section 5. Section 6 shows a case study of ECG application on our platform. Experiments are reported in section 7. Conclusions and future work are given in section 8.
13
Embed
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MPSoC Platform Design and Simulation for Power
Performance Estimation
Zhengjie Lu
Eindhoven University of Technology, Electrical Engineering Department
Den Dolech 2, Postbus 513, 5600 MB, Eindhoven, The Netherlands [email protected]
Abstract—Wireless sensor nodes (WSN) play an important role
in future intelligent applications, such as remote medical
examination and unattended industry-field monitoring. Since such a
device is usually powered by a battery, the design trade-off between
the performance and the power consumption is critical in achieving
maximum battery life. A configurable platform with the
performance and high-level power simulations is essential to
determine the feasible designs in the early design stage. In this
paper, we present such an MPSoC platform and its dynamic power
model. A work flow to enable the design automation with the help
of CHESS and CHECKER1 is provided. An ECG beat-detection
application is implemented on our platform as a case study. The
simulation results predict that the P2P communication architecture
with the software pipelining is optimal on an ECG MPSoC platform
in both the power-constraint scenario and the time-constraint
scenario.
Index terms—MPSoC platform, dynamic power model, work
flow, low power application, high-level power simulation.
1. INTRODUCTION
Wireless sensors are normally constrained by a limited
power budget, which makes a low-power design essential [1].
A favoured approach is to reuse the dedicated hardware
which is optimized for the low power application and
distributed as the Intellectual Property (IP). A full-system
power simulation is then necessary to determine the most
energy efficient configuration.
Power estimation is widely accepted in industry and it
offers promising accuracy. Unfortunately, its simulation
speed can decrease dramatically as the system becomes
complex (e.g. a triple-core processor). This is not efficient in
the early design stage. In fact, a rough estimation on power
consumption is good enough in the early design stage to
compare different configurations. For this reason, a higher-
level power simulation without high accuracy is acceptable
as the starting point of the system design.
Instruction-set simulation (ISS) is a kind of high level
performance simulation, which provides the cycle-accurate
profiles of a single processor with adequate fast simulating
speed. Both the number of active cycles and the energy
consumption of function units are collected during the
1 CHESS and CHECKER are both commercial compilers
from TargetCompiler N.V.
simulation. To support the multi-core simulation, ISS’s are
embedded within a co-simulation environment. The energy
of a full system is the sum of all ISS active energy plus inter-
ISS communication energy. However, different ISS’s hold
different I/O interfaces to the rest of the system and the
alternation in networked architecture requires reworking on
the communication interface.
The group of ULP-DSP (Ultra-Low Power Digital Signal
Processing) at imec Netherlands [24] is developing the ultra
low power WSN for health care and industrial monitoring. A
target multi-core system employs application-specified
instruction processors (ASIPs). The ASIP hardware is
developed with TargetCompiler Designer Tool [2], which is
appropriate for low power applications. The software
running on such an ASIP is compiled by the CHESS
compiler [2]. Besides, ISS for dedicated ASIP is generated
by CHECKER [2], which generates the profiles of software
execution and hardware usage.
Three aspects can be improved in the work flow of
TargetCompiler. First of all, the hardware/software co-
design for multi-core is inadequately supported. An ASIP
and its accompanying software are designed from a single-
core’s point of view. Questions about how these ASIPs are
interconnected and what the optimal networked architecture
is are not optimally supported in the current work flow [2].
Secondly, a convenient programming model for multi-core
applications in TargetCompiler workflow is not provided.
Thirdly, the power simulation is not easy due to the lack of
power models.
A flexible multi-processor system-on-chip (MPSoC)
platform with high-level power models is set up in this paper,
improving the work flow of TargetCompiler. It enables the
design space exploration on the performance and the power
estimation.
This paper is organized as follows. Section 2 introduces
the related work and our contribution. An MPSoC platform
and its power model are described in section 3. The power
models are intensively studied in section 4. A detailed
explanation of our proposed work flow is given in section 5.
Section 6 shows a case study of ECG application on our
platform. Experiments are reported in section 7. Conclusions
and future work are given in section 8.
2. RELATED WORK AND CONTRIBUTION
Many MPSoC platforms have already been proposed in
academic society. In [3] and [4], SimpleSclar [5] is
embedded into the systemC-based framework for co-
simulation. A general ISS-wrapper interface is introduced in
[6], which extends its use to not only SimpleScalar but also
other ISS’s. A full-system platform is also proposed in [7] as
an extension of M5 [8]. To speed up the simulation, mixed-
level simulation is introduced in [9]: the intra-core
computation is simulated by ISS on the cycle-accurate level
and the inter-core communication is simulated by OSCI
TLM-2 [10] on the transaction-accurate level. Application
specified MPSoC platforms are also presented in [11] and
[12]. A composable and predictable MPSoC platform
template for streaming application is proposed in [11]. A
work flow for MPSoC platform automatic synthesis is
developed in [12]. None of them supports power simulation.
To bridge this gap, [13] and [14] integrate Wattch [15] in
their platform to estimate the dynamic power consumption
of SimpleScalar cores. Orion2 [16] combines the core power
model of Wattch with a router power model, aiming at
network-on-chip architectures. Besides, a universal power
simulator McPAT is proposed in [17], which takes the
output of a cycle-by-cycle performance simulator as its input.
Unfortunately, an additional parser is needed for the
integration of McPAT and MPSoC platforms. Unfortunately,
a TargetCompiler ISS cannot be inserted immediately into
the MPSoC platforms above.
This paper is devoted to meet these challenges. Three
contributions are involved in our work:
1) A general MPSoC platform with common memory
interfaces to an ISS (e.g. TargetCompiler ISS).
2) A high-level dynamic power model for our proposed
platform, aiming at predicting the dynamic power
trend in the early design stage.
3) Investigating the impacts of the communication
architecture and program coding on the dynamic
power consumption of an ECG application.
3. MPSOC PLATFORM
A parameterized MPSoC platform not only benefits the
module reuse, but also enables the design space exploration
(DSE). Three sub-systems are included as shown in Fig. 1,
being: (1) IP core subsystem (e.g. IP core 1), (2) I/O tile
subsystem (e.g. I/O tile 1), and (3) OCCN subsystem (e.g.
OCCN bus and P2P). IP core subsystem is the systemC
wrapped ISS (see Fig. 2), while I/O tile and OCCN
subsystem function together as the inter-core communication.
In the following we explain the architecture aspects in
section 3.1, while the programming model on this platform
is introduced in section 3.2.
3.1. Architecture
In Fig. 1, an IP core is connected to an I/O tile subsystem
through its systemC wrapper interface and as a result it can
access either the program memory (PM) or the data memory
(DM) within the I/O tile. Each I/O tile can exchange data
through bus or peer-to-peer-link (P2P) using network
interfaces (NI). A shared memory module can be derived
from the I/O tile if only the DM, the arbiter and the slave NI
are presented in the I/O tile, as shown in Fig. 1. Each
component in Fig. 1 is explained in the following sections.
…
Core address mapping
DM
Arbiter
IP core 1 PM
OCCN bus
Master
NISlave
NI
I/O tile 1
OCCN P2P
Core address mapping
DM
Arbiter
IP core 2 PM
Master
NISlave
NI
I/O tile 2
…
DM
Arbiter
Slave
NI
Shared-memory
Fig. 1. Architecture overview A
dd
ress
Bu
s
Da
ta
Bu
s
ISS
SystemC wrapper
Co
ntr
ol
Bu
s
Address
Bus
Data
Bus
Control
Bus
Interface to
Program
Memory
Interface to Data Memory
DAT2
DAT1
DAT1
DAT2 DAT2
DAT1
DAT1
DAT2
M_NI S_NI
MA
ST
ER
OU
T
MA
ST
ER
IN
REQACK REQ
CO
RE
IN
CO
RE
OU
T
ACK
DM IN DM
OU
TS
LA
VE
OU
T
SL
AV
E
IN
32-bit
16-bit
32-bit 32-bit 32-bit
16-bit 16-bit 16-bit
Fig. 2. IP core subsystem Fig. 3. Multi-word access
through NI
3.1.1. IP core: In our situation, an IP core is presented as a
CHECKER generated ISS. The ISS is supposed to load an
executable which is compiled by the CHESS compiler, and
to perform a cycle-accurate simulation. As we mentioned
above, it needs to be wrapped as a systemC class before
being integrated into the platform. CHECKER can also
perform this job which results in the memory mapped I/O
(MMIO) interface as shown in Fig. 2. Notice that the
addressing bus to the data memory is isolated from the
program memory. The same holds for the data bus and the
control bus. All bus widths are determined by the IP core’s
specifications.
3.1.2. Core address mapping: A core-address-mapping
module provides a de-multiplexer with a single input port
and two identical output ports, bridging the core and I/O tile.
Both input port and output port are configured based on the
system specifications so that they can adapt to different IP
core’s I/O width. A core address mapping decides whether
the IP request goes to the local DM or external DMs. A DM
is local to an IP core when it resides inside the core’s I/O tile.
The remainder of DMs are declared as the external DMs
from this core’s point of view. In our case, the MMIO
address space of a IP core is divided into two continuous
sections. The core address mapping module just maps the
first section to the local DM, and the rest to the external
DMs. Details about the local/external address will be
discussed in section 4.1.
3.1.3. Arbiter: Simultaneous requests from different IP
cores to the same memory must be supported on our
platform. A multi-ported memory seems to be an easy
solution, but it can result in the high power consumption as
well as the large silicon area [18]. For this reason, a single-
ported memory is preferred in the practical low power
designs. As a consequence, arbitration is necessary to
sequence the concurrent requests to the single-ported
memory. However, it may increase the memory access time
from the IP core’s point of view, i.e. one additional cycle is
needed for arbitration. It may become even worse when the
external request competes for the local memory at the same
time. Round-robin scheduling is currently implemented
within the arbiter. which guarantees the IP core can access to
its local DM in no later than 4 cycles (i.e. 2 cycles for
waiting the previous request accomplishment and 2 other
cycles for processing this access) from the IP core’s point of
view. No arbitration is needed for accessing to PM, because
we assume no re-configuration (reloading PM) while
running. Also we assume PM is not the instruction cache.
3.1.4. Memory: The DM module provides universal
interfaces to adapt to different accessing modes: word, byte
and multi-word access. The IP core determines what modes
are supported in local memory and what size a word/vector
is. Any mode of the access to DM is accomplished within 2
cycles from the memory’s point of view. The same holds for
access to the PM module.
3.1.5. OCCN bus/P2P: The OCCN network library [19],
which is developed in the systemC community, is employed
in our platform to simulate the inter-core communication. It
provides two types of communication on the transaction-
accurate level: bus and P2P. The advantages of the
transaction accurate level simulation are the fast simulation
speed and the high level of abstraction. A read transaction
takes 2 clock cycles (one for request, the other for
acknowledgement), while a write transaction costs only one
cycle. Arbitration is only necessary on the bus. , and it does
not cost any clock cycles in the transaction-accurate level
simulation. However, we do add one more clock cycle for
arbitration for the sake of more accurate high-level power
simulation.
3.1.6. NI: The basic transmitting unit in the OCCN network
library is called ―Protocol Data Unit‖ (PDU), which consists
of both the control header and the data body.
Communication occurs between a pair of master port and
slave port inside different NI’s. Two types of NI are
available in our architecture: the master NI as the traffic
initiator and the slave NI as the traffic target. A slave NI is
accessed by a master NI using network addresses (see
section 5.1 for details). A master NI has two data
connections to reach its neighbour slave NI on the right, as
shown in Fig. 1: either through a bus or a direct P2P link.
The routing decision is based on where the data is kept, i.e.
in the streaming addresses (mapped to the P2P link) or the
non-streaming ones (mapped to the bus). This strategy will
not only speed up the regular burst-transfer between two
cores, but also provide a flexible communication. Another
function of an NI is to synchronize the transaction-level
simulation in the OCCN network and the cycle-accurate
level simulation in the I/O tiles. Also it supports the data
format transformation between PDU and the signals within
the I/O tiles. Both the master NI and the slave NI must be
capable to handle the communication between the I/O tiles
with different data bus widths. In our case, data larger than
the network data width (i.e. 16-bit) are broken into multiple
16-bit words by the master NI and then sent to the slave NI
(shown in Fig. 3) in sequence. The slave NI receives the data
segments and assembles them to the complete data.
3.2. Multi-process programming model
The IP-based design can improve not only the IP
hardware but also the dedicated software which is optimized
for the hardware. Softwares running on different IP cores are
taken as individual processes. A universal problem is how to
enable the communication among different processes. In our
case, the data in a process is divided into two groups: the
private data and the shared data. The communication
between the different processes can only take place with the
shared data. How to address the shared data and how to
synchronize the accesses to the shared data are the topics in
this section. Core
address
Local
address
External
address
Private
address
Shared
address
Stream
address
Non-stream
address
Fig. 4. Address hierarchy
3.2.1. Process communication: The starting point is the
address mapping. Two types of addresses are defined from
the IP core’s point of view as shown in Fig. 4: (1) the local
address and (2) the external address. The first one is mapped
to the IP core’s local DM, while the latter is mapped to the
external DMs. The local address space can be divided further
as: (1) the private address which is only addressable to its
local core, and (2) the shared address which can be accessed
by all cores. Hence, the shared data must physically reside in
the shared address. A core can access the shared data, which
is not located in its local memory, through its external
address. A two-step address translation is employed here.
First of all, the master NI translates the IP core’s external
address into the network address so that the target slave NI
can be found in the network. In the second step, the target
slave NI translates the network address to the shared
memory address and puts it on the address bus in the I/O tile.
By doing this, the shared data can be transferred among
different I/O tiles. As we mentioned before, there exist two
data connections from a master NI to its neighbour slave NI.
In our case, a number of network addresses, which are
available for the master NI, are specified as the stream
addresses. Those data located in these stream addresses will
be exchanged through the P2P connection. Those non-stream
addresses would be mapped to the address space of the bus
communication.
Fig. 5. Pseudo codes for the P/V synchronization example
3.2.2. Process synchronization: Before transferring the
shared data, synchronization between processes is necessary.
Two types of software synchronization are employed on our
platform. The first one is based on the P/V primitives [20].
Two semaphores are defined as:
Start semaphore: it indicates whether a core is started.
Busy semaphore: it indicates whether a core is ready
for transferring shared data.
Fig. 5 shows the pseudo codes of the P/V synchronization. A
core can only start processing after it is informed by the start
semaphore. After processing is complete, it is blocked until
the busy semaphore is released. Then it will write the
processing results to the destination memory. The second
type employs FIFO-based synchronization [21]. The core
acting as the traffic initiator can only write data to a FIFO
when it is ―not full‖, while the core acting as the traffic
target can only read data from a FIFO when it is ―not empty".
This type of synchronization provides the opportunity to
make use of the data pipelining. But a side-effect is also
significant: it has to load both the read/write pointers every
time before it can access the FIFO.
4. DYNAMIC POWER MODEL
On the architecture level, an electronic system’s dynamic
energy can be defined as the sum of the dynamic energy
consumed by all its components. The power dissipation on
the wires is assumed zero. From this, we can derive the
general expression for the system dynamic power as:
,
,,
sys dynm
sys dynmapp dynm
EP
t
where,sys dynmP and
,sys dynmE are the system’s average
dynamic power and the total dynamic energy, respectively.
,app dynmt is the total time of completing the application (e.g.
duty time). This power model is at high abstraction level and
not necessarily accurate, e.g. the leakage power dissipation
during the idle time is not included. However, it can be still
used to predict the power trends within different
architectures. This is sufficient in the early design stage.
Our architecture consists of the cores, the I/O tiles and the
networks. So its dynamic energy is the sum of the dynamic
energy contribution of all three types:
, , , ,1 1 1
( ) ( ) ( )core io networkNN N
sys dynm core dynm io dynm network dynmi j k
E E i E j E k
in which coreN is the total number of IP cores in the system.
, ( )core dynmE i denotes the total energy of the ith IP core.
Similar meanings hold for the other terms. It should be
pointed out that ioN might be larger than coreN when a
standalone shared memory is presented on the bus. Details
about each term in the equation above will be explained in
the next sections.
4.1. IP core dynamic energy model
An IP core’s dynamic power model on high abstract level
is usually a constant, which is the mean dynamic power
number taken from either the layout back-annotated power
simulation or the silicon measurements. In this case, its
dynamic energy is defined as:
, , ,core dynm core dynm core dynmE P t (1)
where ,core dynmP is the core’s dynamic power number and
,core dynmt is its active time.
Due to the software synchronization, an IP core can be
active even though it is not processing the data. This is due
to the absence of DMA in our current system. The core’s
active time are divided into three phases: the synchronization
phase, the memory-transfer phase and the computation phase.