On-Chip Memory Architecture Exploration of Embedded System on Chip A Thesis Submitted for the Degree of Doctor of Philosophy in the Faculty of Engineering by T.S. Rajesh Kumar Supercomputer Education and Research Centre Indian Institute of Science Bangalore – 560 012 September 2008
203
Embed
On-Chip Memory Architecture Exploration of Embedded System ... · PDF fileOn-Chip Memory Architecture Exploration of Embedded System on Chip A Thesis Submitted for the Degree of Doctor
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
On-Chip Memory Architecture Exploration of
Embedded System on Chip
A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
T.S. Rajesh Kumar
Supercomputer Education and Research Centre
Indian Institute of Science
Bangalore – 560 012
September 2008
To my Family, Sree, Amma, Advika and Adarsh
Abstract
Today’s feature-rich multimedia products require embedded system solution with complex
System-on-Chip (SoC) to meet market expectations of high performance at low cost and
lower energy consumption. SoCs are complex designs with multiple embedded processors,
memory subsystems, and application specific peripherals. The memory architecture of
embedded SoCs strongly influences the area, power and performance of the entire system.
Further, the memory subsystem constitutes a major part (typically up to 70%) of the
silicon area for the current day SoC.
The on-chip memory organization of embedded processors varies widely from one
SoC to another, depending on the application and market segment for which the SoC is
deployed. There is a wide variety of choices available for the embedded designers, starting
from simple on-chip SPRAM based architecture to more complex cache-SPRAM based
hybrid architecture. The performance of a memory architecture also depends on how
the data variables of the application are placed in the memory. There are multiple data
layouts for each memory architecture that are efficient from a power and performance
viewpoint. Further, the designer would be interested in multiple optimal design points
to address various market segments. Hence a memory architecture exploration for an
embedded system involves evaluating a large design space in the order of 100,000 of
design points and each design points having several tens of thousands of data layouts.
Due to its large impact on system performance parameters, the memory architecture is
often hand-crafted by experienced designers exploring a very small subset of this design
space. The vast memory design space prohibits any possibility for a manual analysis.
In this work, we propose an automated framework for on-chip memory architecture
exploration. Our proposed framework integrates memory architecture exploration and
data layout to search the design space efficiently. While the memory exploration selects
specific memory architectures, the data layout efficiently maps the given application on
to the memory architecture under consideration and thus helps in evaluating the memory
architecture. The proposed memory exploration framework works at both logical and
physical memory architecture level. Our work addresses on-chip memory architecture for
DSP processors that is organized as multiple memory banks, with each back can be a
single/dual port banks and with non-uniform bank sizes. Further, our work also address
memory architecture exploration for on-chip memory architectures that is SPRAM and
cache based. Our proposed method is based on multi-objective Genetic Algorithm based
and outputs several hundred Pareto-optimal design solutions that are interesting from a
area, power and performance viewpoints within a few hours of running on a standard
desktop configuration.
Acknowledgments
There are many people I would like to thank who have helped me in various ways.
First and foremost I would like to thank my Supervisors, Prof. R. Govindarajan and
Dr.C.P. Ravikumar, who have guided me and supported me in various aspects through the
entire journey in completion of my thesis work. I profusely thank for the encouragement
they provided and their perseverance in keeping me focused on the Ph.D. work.
I would like to express my gratitude to Texas Instruments for giving me the time
and opportunity to pursue my studies. I would like to thank my colleagues at Texas
Instruments for their support and reviews. In particular my manager Balaji Holur.
I would also like to thank my previous managers Pamela Kumar and Manohar Sam-
bandam.
Last but not the least, I would like to thank my dearest family members for the
encouragement they provided and the sacrifices they made to help me achieve my goals.
1. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. On-chip Memory Architecture
Exploration Framework for DSP Processor Based Embedded SoC. Submitted to the ACM
Transactions on Embedded Computing Systems, May 2008.
2. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Architecture Explo-
ration Framework for Cache-based Embedded SoC. In Proceedings of the International
Conference on VLSI Design, Jan 2008.
3. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MODLEX: A Multi-Objective
Data Layout EXploration Framework for Embedded SoC. In Proceedings of the 12th Asia
and South Pacific Design Automation Conference (ASP-DAC), Jan 2007.
4. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MAX: A Multi-Objective
Memory Architecture Exploration Framework for Embedded SoC. In Proceedings of the
International Conference on VLSI Design, Jan 2007.
5. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Embedded Tutorial on Multi-
Processor Architectures for Embedded SoC. In Proceedings of the VLSI Design and Test,
Aug 2003.
6. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Optimal Code and Data Lay-
out for Embedded Systems. In Proceedings of the International Conference on VLSI
Design, Jan 2003.
7. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Exploration for Em-
bedded Systems. In Proceedings of the VLSI Design and Test, Aug 2002.
2 List of Publications from this Thesis
Chapter 1
Introduction
1.1 Application Specific Systems
Today’s VLSI technology allows us to integrate tens of processor cores on the same chip
along with embedded memories, application specific circuits, and interconnect infrastruc-
ture. As a result, it is possible to integrate an entire system onto a single chip. The single
chip phone, which has been introduced by several semiconductor vendors, is an example of
such a system-on-chip; it includes the modem, radio transceiver, power management func-
tionality, a multimedia engine and security features, all on the same chip. An embedded
system is an application-specific system which is optimized to perform a single function
or a small set of functions [70]. We distinguish this from a general-purpose system, which
is software-programmable to perform multiple functions. A personal computer is an ex-
ample of a general-purpose system; depending on the software we run on the computer,
it can be useful for playing games, word processing, database operations, scientific com-
putation, etc. On the other hand, a digital camera is an example of an embedded system,
which can perform a limited set of functions such as taking pictures, organizing them, or
transferring them to another device through a suitable I/O interface. Other examples of
embedded systems include mobile phones, audio/video players, videogame consoles, set-
top boxes, car infotainment systems, personal digital assistants, telephone central-office
switches, dedicated network routers and bridges. Note that a large number of embedded
4 Introduction
systems are built for the consumer market. As a result, in order to be competetive, the
cost of an embedded system cannot be very high. Yet, the consumers demand higher per-
formance and more features from the embedded systems products. It is easy to appreciate
this point if we compare the performance and feature set offered by mobile phones that
cost Rs 5000/-(or 100$) today and which cost the same a few years ago. We also see that
a large number of embedded systems are being built for the mobile market. This trend
is not surprising - the number of mobile phone subscribers increased from 500 Million in
year 2000 to 2.6 Billion in 2007 [7]. Because of such high volumes, embedded systems are
extremely cost sensitive and their design demands careful silicon-area optimization. Since
mobile devices use batteries as the main source of power, embedded systems must also be
optimized for energy dissipation. Power, which represents the rate at which energy is con-
sumed, must also be kept low to avoid heating and improving reliability. In summary, the
designer of an embedded system must simultaneously consider and optimize price, perfor-
mance, energy, and power dissipation. Application specific embedded systems designed
today demand innovative methods to optimize these system cost functions [11, 19].
Many of today’s embedded systems are based on system-on-chip platforms [16], which,
in turn, consist of one or more embedded microcontrollers, digital signal processors (DSP),
application specific circuits and read-only memory, all integrated into a single package.
These blocks are available from vendors of intellectual property (IP) as hard cores or soft
cores [42, 28]. A hard core, or hard IP block, is one where the circuit is available at a
lower level of abstraction such as the layout-level [42, 28]; it is impossible to customize a
hard IP to suit the requirements of the embedded system. As a result, there are limited
opportunities in optimizing the cost functions by modifying the hard IP. For example, if
some functionality included in the IP is not required in the present application, we cannot
remove the function to save area. Soft IP refers to circuits which are available at a higher
level of abstraction, such as register-transfer level [28, 42]. It is possible to customize the
soft IP for the specific application. The designer of an embedded SoC integrates the IP
cores for processors, memories, and application-specific hardware to create the SoC.
Figure 1.1 illustrates the architecture of an embedded system-on-chip (SoC). As can
1.2 Memory Subsystem 5
be seen in the figure, there are four principal components in such an SoC.
1. An Analog Front End which includes the analog/digital and digital/analog convert-
ers
2. Programmable Components which include microprocessors, microcontrollers, and
DSPs. The number of embedded processors is increasing every year. An interesting
statistic shows that of the nine billion processors manufactured in 2005, less than 2%
were used for general-purpose computers. The other 8.8 billion went into embedded
systems [13]. The microcontroller/microprocessor is useful in handling interrupts,
house-keeping and performing timing related functions. The DSP is useful for pro-
cessing the audio and video information e.g., compression and decompression of
audio and video information. The application software is normally preloaded in
the memory and is not user programmable, unlike general-purpose processor-based
systems
3. Application-specific components – these include hardware accelerators for compute-
intensive functions. Examples of hardware accelerators include digital image pro-
cessors which are useful in cameras
1.2 Memory Subsystem
1.2.1 On-chip Memory Organization
The memory architecture of an embedded processor core is complex and is custom de-
signed to improve run-time performance and power consumption. In this section we
describe only on the memory architecture of the DSP processor as this is the focus of
the thesis. This is because, the memory architecture of the DSP is more complex than
that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are
more data dominated than the control-dominated software executed on an MCU. Mem-
ory bandwidth requirements for DSP applications range from 2 to 3 memory accesses per
6 Introduction
Figure 1.1: Architecture of an Embedded SoC
processor clock cycle. For an MCU, this figure is, at best, one memory access per cycle.
(b) It is critical in DSP application to extract maximum performance from the memory
subsystem in order to meet the real-time constraints of the embedded application. As a
consequence, the DSP software for critical kernels is developed mostly as hand optimized
assembly code. In contrast, the software for MCU is typically developed in high-level
languages. The memory architecture for a DSP is unique since the DSP has multiple on-
chip buses and multiple address generation units to service higher bandwidth needs. The
on-chip memory of embedded processors can include (a) only Level-1 cache (L1-cache)
(e.g., [1]), (b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of
L1-cache and SPRAM (e.g., [2, 77]).
1.2.2 Cache-based Memory Organization
Purely cache-based on-chip memory organization is generally not preferred by embedded
system designers as this organization cannot guarantee the worst-case execution time
constraints. This is because the access time in a cache based system can vary depending
on whether the access results in a cache miss or a hit [33]. As a consequence, the run-time
1.2 Memory Subsystem 7
performance of cache-based memory subsystems varies, based on the execution path of
application and is data dependent. However cache architecture is advantageous in the
sense that it reduces programmer’s responsibility in terms of placement of data to achieve
better memory access time. Further the movement of data from off-chip memory to cache
is transparent. In [12], the authors present a comparison study of SPRAM and cache for
embedded applications and conclude that SPRAM has 34% smaller area and 40% lower
power consumption than a cache of the same capacity. There is published literature to
estimate the worst case execution time [81] and find an upper bound on run-time [78]
for cache-based embedded systems. Hence it was argued that for real-time embedded
systems which require stringent worst-case performance guarantee, purely cache based
on-chip organization is not suitable.
1.2.3 Scratch Pad Memory-based Organization
On-chip memory organization based only on Scratch Pad memory ensures single cycle
access times and guarantees on worst-case execution for data that resides in Scratch-Pad
RAM (SPRAM). However, it is the responsibility of the programmer to identify data
section that should be placed in SPRAM or place code in the program to appropriately
move data from off-chip memory to SPRAM. A DSP core can include the following types of
memories static RAM (SRAM), ROM, and/or dynamic RAM (DRAM). The scratch pad
memory in the DSP core is organized into multiple memory banks to facilitate multiple
simultaneous data accesses. A memory bank can be organized as a single-access RAM
(SARAM) or a dual-access RAM (DARAM) to provide single or dual access to the memory
bank in a single cycle. Also the on-chip memory banks can be of different sizes. Smaller
memory banks consume lesser power per access than the larger memories. The embedded
system may also be interfaced to off-chip memory, which can include SRAM and DRAM.
Purely SPRAM based on-chip organization is suitable only for low to medium complex
embedded applications. SPRAM based systems do not use the on-chip RAM efficiently
as it requires the entire data sections that are currently accessed to be placed exclusively
8 Introduction
in the SPRAM. It is possible to accommodate different data sections in SPRAM at dif-
ferent points in execution time by moving data dynamically between off-chip memory
and SPRAM. But this results in certain run-time overhead and increase in code size.
For medium to large applications, which have large number of critical data variables, a
large amount of on-chip RAM will become necessary to meet the real-time performance
constraints. Hence for such applications pure SPRAM architecture are not preferred.
1.3 Data Layout
To efficiently use the on-chip memory, critical data variables of the application need to be
identified and mapped to the on-chip RAM. The memory architecture may contain both
on-chip cache and SPRAM. In such a case it is important to partition the data section and
assign them appropriately to on-chip cache and SPRAM such that memory performance
of the application is optimized. Further, among the data sections assigned to on-chip
cache and SPRAM, a proper placement of the data sections on the cache and SPRAM
is required to ensure that the cache misses are reduced and the multiple memory banks
of the SPRAM and the dual ported SPRAMs are efficiently utilized. Identifying such a
data placement for data sections, referred to as the data layout problem, is complex and
critical step [10, 53]. This task is typically performed manually as the compiler cannot
assume that the code under compilation represents the entire system [10].
The application program in a modern embedded system is complex since it must
support a variety of device interfaces such as networking interfaces, credit card readers,
USB interfaces, parallel ports, and so on. The application also has many multimedia
components like MP3, AAC and MIDI [8]. This necessitates an IP reuse methodology
[74], where software modules developed and optimized independently by different vendors
are integrated. Figure 1.2 explains the typical flow in embedded application develop-
ment. This integration is a very challenging job with multiple objectives: (a) it has to
be done under tight constraints on time-to-market constraints, (b) it has to be repeated
for different variants of SoCs with different custom memory architectures, and (c) it has
to perform in such a way that the embedded application is optimized for performance,
1.3 Data Layout 9
power consumption and cost.
Figure 1.2: Embedded Application Development Flow
Since the IPs/modules are independently optimized, the integrator is under pressure
to deliver the complete product with the expectation that each component performs at the
same level as it did in isolation. This is a major challenge. When a module is optimized
independently, the developer has all the resources of the SoC (MIPS and Memory) to
optimize the module. When these modules are integrated at the system-level, the system
resources are shared among the modules. So the application integrator needs to know
the MIPS and memory requirements of the modules unambiguously to be able to allocate
the shared resources to critical needs [74]. Usually, the modules memory requirements
are given only at a high level. To be able to optimize the whole application/system, the
integrator will need detailed memory analysis at the module-level; e.g., which data buffers
need to be placed in dual ported memories and which data buffers should not be placed
in the same memory bank – this data is usually not available. Moreover, the critical code
is usually written in low-level assembly language to meet real-time constraints and/or
10 Introduction
due to legacy reasons. Because of the above mentioned reasons, the application integra-
tion/optimization, analyzing the application and mapping software modules in order to
obtain optimal cost and performance takes significant amount of time (approximately 1-2
man months). Currently in most of the SoC design data layout is also performed manually
and it has two major problems:(1) the development time is significant – not acceptable
for current-day time to market requirements, (2) quality of solution varies based on the
expertise.
1.4 Memory Architecture Exploration
In modern embedded systems, the area and power consumed by the memory subsystem is
up to 10 times that of the data path, making memory a critical component of the design
[11]. Further, the memory subsystem constitutes a large part (typically up to 70%) of
the silicon area for the current day SoC and it is expected to go up to 94% in 2014 as
shown in the Figure 1.3 [6]. The main reason for this is that embedded memory has
a relatively smallsubsystem per-area design cost in terms of both man-power, time-to-
market and power consumption [60]. Hence the memory plays an important role in the
design of embedded SoCs. Further the memory architecture strongly influences the cost,
performance and power dissipation of an embedded SoC.
As discussed earlier, the on-chip memory organization of embedded processors varies
widely from one SoC to another, depending on the application and market segment for
which the SoC is deployed. There is a wide variety of choices available for the embed-
ded designers, starting from simple on-chip SPRAM based architecture to more complex
cache-SPRAM based hybrid architecture. To begin with, the system designer needs to
decide if the SoC requires cache and what is the right size of on-chip RAM. Once the high
level memory organization is decided, the finer parameters need to be defined to complete
the memory architecture definition. For the on-chip SPRAM based architecture, the pa-
rameters, namely, size, latency, number of memory banks, number of read/write ports per
memory bank and connectivity, collectively define the memory organization and strongly
influence the performance, cost, and power consumption. For cache based on-chip RAM,
1.4 Memory Architecture Exploration 11
Figure 1.3: Memory Trends in SoC
the finer parameters are the size of cache, associativity, line size, miss latency and write
policy. Due to its large impact on system performance parameters, the memory architec-
ture is often hand-crafted by the designer based on the targeted applications. However,
with the combination of on-chip SPRAM and cache, the memory design space is too large
for a manual analysis [31]. Also, with the projected growth in the complexity of embed-
ded systems and the vast design space in memory architecture, hand optimization of the
memory architecture will soon become impossible. This warrants an automated frame-
work which can explore the memory architecture design space and identify interesting
design points that are optimal from a performance, power consumption and VLSI area
(and hence cost) perspective. As the memory architecture design space itself is vast, a
brute force design space exploration tool may take large computation time and hence is
unlikely to be useful in meeting the tight time-to-market constraint. Further, for each
given memory architecture, there are several possible data section layouts which are opti-
mal in terms of performance and power. This further compounds the memory architecture
exploration problem.
12 Introduction
1.5 Embedded System Design Flow
In this section, we present our view of embedded system design flow to set the context
for our work. For this purpose, we introduce the notion of the X-chart, which is inspired
from the well-known Y-chart introduced by Gajski to capture the process of VLSI system
design [29].
In a Y-chart, the three levels of design abstraction form the three dimensions of the
figure Y; these are (a) design behavior, (b) design structure and (c) physical aspects of
the design. A design flow starts from a behavior specification, which is then mapped to
a structure, which in turn is mapped to a physical realization. We can view the process
of transforming a behavior to a physical realization as a successive refinement process.
Optimization of design metrics such as area, performance, and power are the goals of
each of these refinement steps. The design process may spiral from the behavioral axis to
structural axis to physical design axis in multiple stepwise refinement steps.
We introduce the notion of the X-chart, which is illustrated in Figure 1.4. The X-
chart representation has four axes: (a) Behavior, (b) Logical Architecture, (c) Physical
Architecture and (d) Software Data Layout. The logical memory architecture (LMA)
defines the embedded cache size, cache associativity, cache block size, size of the scratch
pad memory, number of memory banks, and the number of ports. The physical memory
architecture (PMA) is an actual realization of an LMA using the memory library com-
ponents provided by the semiconductor vendor. The fourth dimension, namely Software
Data Layout, is necessary for capturing the process of embedded system design. We have
identified several steps in the embedded system design flow and marked them with circled
numbers. Table 1.1 explains the individual steps in the X-chart representation.
The design of an embedded system begins with a behavioral description (Point (1)
in Figure 1.4, which is shown on the behavioral axis). Today, there are many languages
available to capture the system behavior, e.g., System Verilog [5], System C [4], and so
on. Hardware-software partitioning is performed to identify which functionalities of the
description are best performed in hardware and which are best implemented in software.
Hardware implementation is cost-intensive, but improves the performance.
1.5 Embedded System Design Flow 13
We show point (2) on the LMA axis, since hardware-software partitioning adds con-
siderable amount of detail to decide the LMA parameters. The next step is to select
hardware and software IP blocks. Depending on the time schedule (for designing the
embedded system) and the cost constraint, the designer may wish to use readily available
IP blocks from a vendor or implement a custom version of the IP. The target platform
is then defined to implement the embedded system. As mentioned earlier, a platform in-
cludes one or more processors, memory, and hardware accelerators for specific functions.
Platforms also come with software tools such as compilers and simulators, so that the
development cycle can be accelerated. In other words, one does not need to wait for the
hardware implementation to complete before trying out the software. We show point (4)
on the software data layout axis, since the selection of a platform defines many aspects
of software implementation. Software partitioning is now performed to decide which soft-
ware IP blocks are executed on which processor. This completes one spiral cycle in the
design life cycle of the embedded system. To recapitulate, the following components are
defined at the end of the first cycle (a) the platform on which the embedded system
will be built, (b) the hardware and software IP blocks that are selected for the target
application, (c) assignment of software IP blocks to target processors where the software
will be executed. We show point (5) on the behavioral axis, since the next spiral cycle
will begin from here.
The next step is to define the logical memory architecture for the memory subsystem.
Guided by considerations such as cost, performance, and power, the designer must decide
basic architectural parameters of the memory sub-system, such as whether or not to
provide cache memory, how many memory banks are provided, whether or not dual-
ported memories are necessary for guaranteeing performance, etc. The next step is to
perform design space exploration in the logical space. Each logical memory architecture
is also characterized by the selection of values for parameters such as cache size, cache
associativity, cache block size, etc. There is often a cost/performance tradeoff between two
solutions in the architectural space. Hence the designer must consider different Pareto-
optimal solutions that exhibit cost/performance tradeoff. This results in point (6) in
14 Introduction
Figure 1.4.
Figure 1.4: Application Specific SoC Design Flow Illustration with X-chart
A logical memory architecture must be translated into a physical implementation by
selecting components from the semiconductor vendors memory library. There are multiple
realizations, i.e., physical memory architectures (PMA) for the same LMA. This involves
choosing the appropriate modules based on the process technology selected in step (7),
and the corresponding semiconductor vendor memory library. These represent tradeoff
in terms of power consumed and VLSI area. This leads to point (7) in Figure 1.4. The
mapping of an LMA to a PMA is similar to the technology mapping step in logic synthesis
[53]. Data Layout (DL) is the subsequent step in the design life cycle. During this step,
the placement of data variables is determined, considering every possible implementation
1.5 Embedded System Design Flow 15
Table 1.1: Explanation of Xchart Steps
of the physical memory architecture. Once again, there are multiple solutions for data
layout for a given PMA. These solutions may exhibit tradeoffs in power, performance,
and area.
In this thesis, we use the phrase Physical Memory Architecture Exploration (PMAE)
to refer to the search for Pareto-optimal LMA/PMA/DL solutions. We capture this in
the form of an equation that follows.
PMAE =
Logical Memory Architecture Exploration
+
Memory Allocation Exploration
+
Data Layout Exploration
(1.1)
16 Introduction
In this thesis, the focus is on memory sub-system optimization, constituted by steps
(5) to (9) in Figure 1.4. The size of the solution space increases manifold during each
step of the memory exploration. If N1 optimal solutions (logical memory architectures)
are identified during memory sub-system definition, memory allocation must be explored
for each one of them, which can potentially result in N1 × N2 solutions during memory
allocation exploration. Similarly, data layout must be performed for each of the N1 ×N2
solutions from the memory allocation exploration step, and we may in general obtain
N1 ×N2 ×N3 Pareto-optimal points in the PMAE solution space. As mentioned earlier
this problem can result in exploring a combinatorially exploding large design space.
1.6 Contributions
First, we propose methods for data layout optimization, assuming a fixed memory archi-
tecture for a DSP-based embedded system architecture. Data layout is a critical compo-
nent in the embedded design cycle and decides the final configuration of the embedded
system. Data layout happens at the final stage in the life cycle of an embedded system, as
illustrated in the X-chart of Figure 1.4. Data layout forms the foundation for memory sub-
system optimization. Hence, we first formulate data section layout as an Integer Linear
programming (ILP) problem. The proposed ILP formulation can handle: (i) partitioning
of data between on-chip and off-chip memory, (ii) handling simultaneously accessed data
variables (parallel conflict) in different on-chip memory banks, (iii) placing data variables
that are accessed concurrently (self conflict) in dual-access RAMs, (iv) overlay of data sec-
tions with non-overlapping life times, and (v) swapping of data sections from/to off-chip
memory.
An important contribution of this work is the development of a simple unified ILP
formulation to handle all the above mentioned optimizations. The ILP based approach
is very effective for many moderately complex applications and delivers optimal results.
However, as the application complexity increases, the execution time of ILP method
increases drastically, making them unsuitable for large applications and in situations (such
as memory architecture exploration) where the data layout need to be solved repeatedly.
1.6 Contributions 17
Hence we looked at developing faster methods to solve this problem. We propose a
heuristic algorithm that maps the data sections to the given memory architecture and
reduces the number of memory access conflicts resulting from both self conflicts and
parallel conflicts. Finally, we also formulate the same problem in Genetic Algorithm (GA)
and compare the results of the heuristic with GA. We find that the heuristic algorithm
performs within 5% of GA’s results with GA performing better. However, the heuristic
algorithm’s run-time is an order faster than GA’s run-time making it suitable to be used
for memory architecture exploration.
Next, we address logical memory architecture exploration for DSP-based embedded
systems (step (5) to (7) in the X-chart of Figure 1.4). The input is a set of high-level
memory parameters such as the number of memory banks, size of each memory bank,
number of ports etc., that define the memory sub-system. The goal of the exploration is
to find an optimal on-chip memory organization that can run the given applications with
minimum number of memory-stalls. When an LMA is generated, it must be evaluated
for cost (in terms of VLSI area) and performance. But these depend on the data layout.
Hence to evaluate a memory architecture properly, we must first generate an efficient
data layout. We use the fast heuristic method proposed by us. We have implemented
the memory architecture exploration problem as a two-level hierarchical search, with
architectural exploration at the outer level and data-layout exploration at the inner level.
A multi-objective GA and a Simulated Annealing algorithm (SA) are used as alternate
search mechanisms for the architectural exploration problem. As the memory architecture
exploration framework consider both performance and cost (VLSI area) objectives, we
use the Pareto-optimality constraint proposed in [25] to identify design points that are
interesting from one or the other objective.
The proposed memory exploration framework is fully automatic and flexible. The
framework is also scalable, and additional objectives like power consumption can be added
easily. We have used four different applications from multimedia and communication
domains for our experiments and found 100-200 Pareto-optimal design choices (memory
architectures) for each of the applications.
18 Introduction
Next, we explore the data layout design space for a given physical memory architecture
in order to optimize the performance and power consumption of the memory subsystem.
Note that data layout exploration forms the step (8) to (9) in the X-chart representation.
We propose MODLEX, a Multi Objective Data Layout EXploration framework based
on Genetic Algorithm that explores the data layout design space for a given logical and
physical memory architecture and obtains a list of Pareto-optimal data layout solutions
from performance and power perspectives. Most of the existing work in the literature
assumes that performance and power are non-conflicting objectives with respect to data
layout. However we show that there is a significant trade-off (up to 70%) that is possible
between power and performance.
Our next step is physical memory architecture exploration (step (5 to 8) in Figure 1.4).
We propose two different methods for physical memory exploration. The first approach is
an extension of the Logical Memory Architectural Exploration (LMAE) method described
in Chapter 4 and represented in X-chart by step 5 to 6. Physical memory exploration
is performed by taking the output of LMAE and for each of the Pareto-optimal logical
memory architecture, performing a memory allocation exploration (step (6 to 7)) with an
objective to optimize power and area in the physical memory space. Note that the data
layout is fixed at the logical memory exploration stage itself and hence the performance
does not change at this step. The memory allocation exploration is formulated as a multi-
objective Genetic search to explore the design space with power and area as objectives.
We refer to this approach as LME2PME.
The second approach is a direct and integrated approach for Physical Memory Ex-
ploration, which we refer to as DirPME. This approach corresponds to a direct move
from point 5 to point 8 in Figure 1.4. In this approach, we integrate three critical com-
ponents together: (i) Logical Memory Architecture Exploration, (ii) Memory Allocation
Exploration (iii) Data layout exploration. The core engine of the memory architecture
exploration framework is formulated as a Multi-objective Non-Dominated Sorting Genetic
Algorithm (NSGA) [25]. For the data layout problem, which needs to be solved for thou-
sands of memory architectures, we use our fast efficient heuristic data layout method.
1.6 Contributions 19
Our integrated memory architecture exploration framework searches the design space by
exploring 1000s of memory architectures and lists down 200-300 Pareto-optimal design
solutions that are interesting from an area, power, and performance view point.
Next, we address the memory architecture exploration problem for hybrid memory ar-
chitectures that have a combination of SPRAM and cache. For such a hybrid architecture,
a critical step is to partition the data between on-chip SPRAM and external RAM. Data
partitioning aims at improving the overall memory sub-system performance by placing
data in SPRAM that have the following characteristics: (a) higher access frequency, (b)
data that has a overlapping life time with many other data, and (c) data that has poor
spatial access characteristics. By placing all data that exhibits the above characteristics in
SPRAM results in reducing the number of potentially conflicting data in cache, reducing
the cache misses, leading to overall memory sub-system performance improvement.
But typically the SPRAM size is small and it is not possible to accommodate all the
data identified for SPRAM placement. Hence, even after data partitioning, there will be
a significant number of potentially conflicting data sections that need to be placed in ex-
ternal RAM. If these data are need to be placed in the caches such that the conflict misses
causes between them is reduced. Cache-conscious data layout addresses this problem and
aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache
misses. This is achieved by an efficient data layout heuristic that is independent of in-
struction caches, optimizes run-time and keeps the off-chip memory address space usage
under check. We extend the above approach and perform hybrid memory architecture
exploration with the objective to optimize run-time performance, power consumption and
area.
The salient feature of our work are as follows.
• First, we provide a unified framework for logical memory exploration, memory al-
location exploration, and data layout
• Our work addresses power, performance, area optimization in an integrated frame-
work
20 Introduction
• Our work addresses memory architecture exploration framework for a hybrid mem-
ory architecture involving on-chip SPRAM and cache.
• Our work does not rely on source-code optimization for power and performance
optimization. Hence it is suitable for Platform-based/IP-based system design
1.7 Thesis Overview
The rest of the thesis is organized as follows. In the following chapter, we provide the
background material for the thesis. We begin by explaining the memory architecture of
a DSP and an MCU. We summarize the software optimizations used in the literature
to improve memory access efficiency. We explain cache-based embedded SoC and their
challenges with respect to predictability. Finally, we introduce the concepts of a Genetic
Algorithm (GA) for optimization, since GA is used in our optimization framework in the
latter chapters.
In Chapter 3, we propose different methods to address the data layout problem for on-
chip SPRAM based memory architecture. First, we propose a Integer Linear Programming
(ILP) based approach. Further, we also propose a fast and efficient heuristic for the data
layout problem. Finally, we formulate the data layout problem in Genetic Algorithm
(GA).
In Chapter 4, we present a multi-objective memory architecture exploration framework
to search the memory design space for the on-chip memory architecture with performance
and memory cost as two objectives. We address the memory architecture exploration
problem at the logical level.
Multi-objectective Data Layout Exploration problem is addressed in Chapter 5. Here,
the data layout design space is explored for a given logical memory architecture and
application with respect to performance and power.
In Chapter 6, we address the memory architecture exploration problem at physical
memory level. In this chapter we propose two different approaches for addressing the
physical memory architecture exploration.
1.7 Thesis Overview 21
SPRAM-Cache based hybrid architecture is considered in Chapter 7. In this chapter,
we propose efficient heuristic to partition data between on-chip SPRAM and cache. Fur-
ther, we propose a cache conscious data layout. The memory design space is explored by
using an exhaustive search based approach.
Finally in Chapter 8, we summarize our work and outline the future work.
As a summary, in Figure 1.5, we map each of the chapters of this thesis into the steps
of X-chart. As can be seen in the figure, this work addresses the memory subsystem explo-
ration and optimization at architectural level taking both hardware design and software
(application development) constraints into consideration.
Figure 1.5: Mapping Chapters to X-chart Steps
22 Introduction
Chapter 2
Background
In this chapter we provide the necessary background information that are useful to un-
derstand the rest of the thesis. The Following section explains the on-chip memory ar-
chitecture of Digital Signal Processors (DSPs) and Microcontrollers (MCUs). Section
2.2 presents the software optimizations used in embedded applications that are targeted
at using on-chip memory efficiently. Section 2.3 describes cache based on-chip memory
architectures and motivates the need for cache-SPRAM based hybrid architectures for
embedded SoCs. In Section 2.4, an overview of Genetic Algorithm is presented. Finally,
in Section 2.5, importance of multi-objective multiple design solutions for platform based
design is explained.
2.1 On-chip Memory Architecture of Embedded Pro-
cessors
2.1.1 DSP On-chip SPRAM Architecture
DSP processor based embedded systems have an on-chip memory which typically has a
single cycle access time [49]. The on-chip memory, also referred to as scratch pad memory,
is mapped into an address space disjoint from the off-chip memory but connected to
24 Background
the same address and data buses. 1Typically the scratch-pad memory is organized into
multiple memory banks to facilitate multiple simultaneous data accesses. DSP Processors
typically have 2 or more address generation units and multiple on-chip buses to facilitate
multiple memory accesses.
Figure 2.1: Example DSP Memory Map
Further, each on-chip memory bank can be organized either as a single-access RAM
(SARAM) or as a dual-access RAM (DARAM), to provide single or dual accesses to
the same memory bank in a single cycle. For example, Texas Instruments TMS320C54X
digital signal processor has two data read buses and one data write bus [75]. and, Texas In-
struments TMS320C55X processor has three data read busses and two data write busses,
since concurrent access to the same array are common in DSP applications [76]. Fig-
ure 2.1 presents memory map of C55X DSPs, where multiple memory banks of SARAM
and DARAM memory banks form a part of memory map, and MMR represents mem-
ory mapped registers which typically contain control registers, status registers and stack
pointers. The DARAM and SARAM regions can be recognized using multiple memory
bank to enable two concurrent accesses.
1We use the terms “scratch pad memory”, “on-chip memory” and “internal memory” interchangeably.Similarly “off-chip memory” and “external memory” are also used interchangeably.
2.1 On-chip Memory Architecture of Embedded Processors 25
2.1.2 Microcontroller Memory Architecture
Microcontroller’s (MCU) are designed to execute control type of applications efficiently.
The applications that run on microcontroller’s are not very data intensive and hence do
not require DARAM. But the real-time constraints of embedded applications and the need
to run the applications in a time bound manner requires on-chip SPRAM. Similar to DSP
memory architectures, even MCU processors have on-chip SPRAM. But unlike DSP’s on-
chip SPRAM, the MCU’s on-chip RAM is not organized as multiple memory banks. This
is because typically the MCU applications do not perform more one memory access per
clock cycle. Although the MCU’s on-chip RAM may be constructed with multiple physical
memory modules due to practical limitations and other constraints such as: (a) smaller
memory modules are faster and power efficient and (b) it is not practical to construct
one large memory module and still meet the access latency constraint. For example, to
construct a 192KB of on-chip SPRAM, hardware designers typically use 6×32KB memory
modules. However, the 6×32KB is normally not exposed to the software application
developers and hence from a software development perspective it is still one monolithic
192KB of memory. This distinction, of what is exposed to the application programmer -
referred to as logical memory architecture and how the same is realized is using physical
memory modules (banks) - referred to as physical memory architecture - is important for
both DSP and MCU.
Further, the MCU’s on-chip SPRAM memory can be realised using non-uniform sized
memory banks to optimize overall system power consumption. For example, it is area
efficient to use large memory modules to construct the on-chip SPRAM memory. But large
memories consume more power per read/write access compared a smaller memory. For
example, a 2KB memory module, typically, consumes only half the power per read/write
access as compared to a 8KB memory module. However, 4×2KB will consume more area
than the 8KB memory module. Hence there is a area-power trade-off in selecting memory
modules to construct on-chip SPRAM. The non-uniform bank sized memory architectures
aims at balancing the area-power objectives.
In this thesis our focus is on the memory architecture optimization for DSPs. This
26 Background
is because, the memory architecture of the DSP is more complex than that of microcon-
trollers (MCU) due to the following reasons: (a) DSP applications are more data dom-
inated than the control-dominated software executed on an MCU. Memory bandwidth
requirements for DSP applications range from 2 to 3 memory accesses per processor clock
cycle. For an MCU, this figure is, at best, one memory access per cycle. (b) It is criti-
cal in DSP application to extract maximum performance from the memory subsystem in
order to meet the real-time constraints of the embedded application. As a consequence,
the DSP software for critical kernels is developed mostly as hand optimized assembly
code. In contrast, the software for MCU is typically developed in high-level languages.
The memory architecture for a DSP is unique since the DSP has multiple on-chip buses
and multiple address generation units to service higher bandwidth needs. The on-chip
memory of embedded processors can include (a) only Level-1 cache (L1-cache) (e.g., [1]),
(b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of L1-cache and
SPRAM (e.g., [2, 77]).
2.2 Software Optimizations
Embedded applications have a built in hierarchy. An application is composed of several
modules, where each module consists of one or more code and data sections [74]. Each
data section consists of a set of data variables, and/or data arrays, grouped purely for the
sake of convenience. But it is also typical that each data array will be named as a separate
section. Software developers spend considerable effort and time to achieve a careful layout
of code and data sections to get maximum performance from the scratch pad memory [74].
Applicability of the optimization and methods used to perform data layout depends on
the memory architecture. There are different software optimizations necessary for on-chip
memory architectures of DSP and MCU processors, which are discussed in the following
subsections.
2.2 Software Optimizations 27
2.2.1 DSP Software Optimizations
To take advantage of the multiple on-chip memory buses provided by the underlying pro-
cessor architecture, software application developers must carefully partition the data into
several independent sections. A data section typically holds an array or a set of program
data structures and is placed contiguously in a memory bank. The data structures that
are used in the same instruction cycle are said to be mutually conflicting and are ideally
assigned to different sections so that they can be placed in different memory banks. As-
signing data structures to separate sections increases the number of placement decisions
drastically.
Several software optimization techniques for improving the performance have been
proposed in the literature [10, 14, 37, 40, 44, 53, 58, 74, 79], including:
• Placing frequently accessed data variables to on-chip SPRAM and placing less fre-
quently accessed data variables in off-chip RAM [10, 53].
• Partitioning data arrays that are accessed simultaneously in the same processor cycle
into different on-chip memory banks. This way multiple data can be simultaneously
accessed in the same cycle without incurring any additional memory stalls [40, 58].
• Mapping a data array which is required to support multiple simultaneous access to
DARAM. This avoids additional memory stalls for two simultaneous accesses. [44].
• Overlay of data structures, typically arrays, to share the same on-chip memory
space. These arrays are referred as scratch buffers [74]. The life time of these
buffers are limited to a software module. Hence scratch buffers corresponding to
different modules, which are not live simultaneously, can share the same on-chip
memory space.
• Swapping critical code and data sections from off-chip memory to on-chip memory
before the execution of the appropriate code segment. This facilities efficient access
to code/data currently being accessed. The benefits of swapping (on-chip access and
reduced memory stalls) should more than compensate the cost of swapping [37, 79].
28 Background
Except for the swapping technique, which works on both code and data, all the other
techniques concentrate only on data. Managing data is very important because most of
the embedded applications are data dominated [19].
Towards achieving this goal, critical code and data — which are accessed frequently —
are identified by performing extensive simulation and profiling of application. The decision
to place a data structure in the on-chip SRAM is taken after analyzing the frequency of
the variable in the application.
The ideal case is where all the critical code and data sections can be placed in the on-
chip memory. While this can result in very high performance, in terms of fewer memory
stall cycles, it is also prohibitively expensive to support such a large on-chip SRAM.
Hence to achieve a good performance/cost ratio, a careful data layout for the memory
architecture is mandatory.
Taking the above optimizations into consideration, a code and data section layout can
be defined as a mapping which specifies where (i.e., in which memory type) the various
code and data sections reside, the memory bank(s) on which the sections reside, the type
of memory access (single or dual access) supported in the memory bank, whether or not
certain code (or data) sections are overlayed, and whether or not certain code (or data)
sections are swapped.
2.2.2 MCU Software Optimizations
Typically, embedded applications running on the MCU are control-oriented and not very
computation intensive. The primary objective is to use the on-chip SPRAM efficiently.
Towards this, application is profiled to get the access frequency of all the data variables.
Frequently accessed variables are placed in on-chip SPRAM and less frequently accessed
variables are placed in off-chip RAM.
With an objective to optimize power consumption, a non-uniform bank size based on-
chip SPRAM architecture is used in [14]. The key idea here is, smaller banks are used to
accommodate the most frequently accessed variables, this placement optimizes the system
power. For example, let a and b be two data variables with 1KB of size and both these
2.3 Cache Based Embedded SOC 29
variables are accessed 100000 times and 20000 times respectively. For an on-chip SPRAM
size of 16KB organized as 4×2KB and 1×8K, placing a in one of the 2KB banks and
placing b in the 8KB is more power optimal than the other way.
2.3 Cache Based Embedded SOC
All programs exhibit the property of locality of reference [68] and the cache memories
exploit this property of the programs to give improved performance. Programs exhibit
two types of localities, temporal and spatial. Temporal locality indicates that a recently
accessed memory location is likely to be accessed again. And spatial locality implies that
a recently accessed memory location’s neighboring location is likely to be accessed.
In cache based architectures, data is placed in an off-chip RAM and copied at run-time
to cache by a hardware cache controller. Cache controllers increase the silicon area, but
eliminate the requirement of data placement and management and the associated run-
time overhead. The mapping of data from off-chip RAM to L1-cache is dictated by the
cache associativity scheme and can create potential side effects like thrashing. Therefore,
a careful analysis of data access characteristics and understanding of temporal access
pattern of the data structures is required to improve the cache performance.
From a power, performance and area perspective, direct mapped caches are preferred
over set-associative caches. However, direct mapped caches incur much more off-chip
memory traffic [36], which, when not handled properly, can lead to very high power
consumption and lower performance. In [36], the traffic inefficiency of direct mapped
caches is evaluated for different embedded and multimedia applications from Mediabench
[43]. The traffic (data movements from off-chip RAM to cache and vice versa) inefficiency
is in the range of 10+ even for large cache sizes, and is mainly attributed to conflict
misses. However, for application specific systems, the code is known a priori and an
optimal cache-conscious data layout will be able to reduce the number of conflict misses
and improve the performance and power consumption by reducing the off-chip memory
traffic.
30 Background
2.3.1 Cache-SPRAM Based Hybrid On-chip Memory Architec-
ture
Designers of real-time embedded memories have typically preferred scratch-pad memories
(SPRAM) over data caches since the latter lead to unpredictability in access latencies as
cache hits and misses result in different access times. For small to medium applications
usage of SPRAM gives acceptable performance when the memory is used efficiently by
using optimal data layout approaches. However, medium to large applications require
large size of SPRAM to meet the real-time performance criterion. This is because the
SPRAM memory is assigned dedicatedly to data-variables and the memory is not reused
or shared among different data-variables. There are dynamic data layout approaches that
aim at sharing the SPRAM by bringing the data-variables from off-chip RAM to on-chip
RAM at run-time [37, 79]. However, these approaches are very complex and makes the
software very difficult to maintain and debug. Data movements in the dynamic layout
approach may lead to code size and run-time overheads. Hence architectures with only
SPRAM as on-chip memory will be highly area inefficient for large applications. On the
other hand, cache based architectures uses the on-chip memory efficiently by sharing the
cache sets among data-variables at run-time. However, cache based architecture results in
unpredictability in execution times. It is difficult to estimate the worst-case guaranteed
run-time performance in a cache based system, which is a requirement for all embedded
systems. Several approaches have been proposed to predict the worst case execution time
in such systems [78, 81]. Hybrid memory architectures have become popular for real-time
embedded systems, since caches offer sharing of on-chip memory space and SPRAM offers
predictability. Hence, many embedded SOC use a mix of SPRAM and cache memories as
shown in Figure 2.2.
2.4 Genetic Algorithms - An Overview
Genetic Algorithms (GA) [30] belong to the class of stochastic search methods [69]. Other
stochastic search methods include simulated annealing [57], threshold acceptance [27], and
2.4 Genetic Algorithms - An Overview 31
Figure 2.2: Cache-SPRAM Based On-Chip Memory Architecture
some forms of branch and bound [24]. Most stochastic search methods operate on a single
solution to the problem at hand. Whereas genetic algorithms operate on a set of solutions
which lead to faster convergence.
To use a genetic algorithm, the problem at hand needs to be encoded as an object.
In GA’s terminology, the encoded object is called a chromosome. A population consist of
a set of such chromosomes. GA combines “randomness“ and “survival of the fittest“ to
perform an effective search in the solution space [30].
In Figure 2.3, the basic flow of GA is explained. To start with, the solution to the
problem at hand needs to be modeled as a chromosome. A better chromosome would mean
a better solution. The next step is create a set of P chromosomes, referred as population,
initialized by the initialization step in Figure 2.4. The objective of the GA is to keep
operating the chromosomes in the current population to generate new chromosomes and
select P fittest chromosomes. GA uses operators like selection, crossover and mutation
for this purpose. Observe that there are two nested loops in GA as shown in Figure
2.4. The outer loop corresponds to the evolution of different generations and the inner
32 Background
loop constitute the GA operations to generate a certain set of new chromosomes within
a generation.
The inner loop starts with the selection operation that picks two of the best individuals
for mating. Some of the more commonly used selection methods [30] are (i) roulette
wheel selection, (ii) tournament selection and (ii) rank selection. The probability of
selecting a chromosome in Roulette wheel selection is proportional to the fitness function
of the chromosome. For tournament selection, a set of chromosomes are selected based
on roulette wheel selection and then the top two best chromosomes are picked among the
selected chromosomes. Rank selection always picks the best two chromosomes based on
the fitness function. We have used roulette wheel selection in our work.
Figure 2.3: Genetic Algorithm Flow
A crossover operation is performed on the selected pair of chromosomes. The two
selected chromosomes are called parents. Crossover operator, typically, takes part of
2.5 Multi-objective Multiple Design Points 33
each of the parents and generates two children. Thus, the children chromosomes are
expected to have a combination of characteristics from both parents. Since the parents
are best chromosomes from the current population, the children are expected to be better
than parents by evolution theory. Typically a 3-point cross over is performed, which
is illustrated in Figure 2.3. After the crossover operation, the mutation operation is
performed with certain probability. The mutation operation randomly change certain
elements (flip a bit) of the chromosome. The mutation operator introduces certain amount
of randomness to the search. It can help the search find solutions that crossover alone
might not encounter. For each of the new chromosomes, objective functions are computed.
There can be more than one objective. A fitness function is assigned based on the set of
objectives. The fitness function represents how good a chromosome is (in other words, how
good a solution is). This set of operation is repeated (inner loop) till M new chromosomes
are generated. For each of the new M chromosomes, the objective functions and fitness
values are computed.
The last step in a generation (outer loop) is the anhilation step that is a representation
of “survival of the fittest“ concept. At the end of the inner loop there will be a total of
P +M chromosomes, where P are the parents and M are the newly generated child. Out
of the P + M chromosomes, the top P chromosomes with respect to the fitness functions
are selected and passed on to the next generation and the remaining chromosomes are
ignored. The outer loop is repeated for a given number of generations.
2.5 Multi-objective Multiple Design Points
Platform based design is a way to address the complexity of embedded system design
under tight deadlines. It is common to build systems around the same computational
platform which includes a microprocessor or microcontroller for running the operating
system and a DSP for running media-related applications. The same platform will there-
fore need to cater to different application characteristics. The OMAP platform from
Texas Instruments comes in several flavors to address the market diversity. Similarly,
Texas Instruments offers two variants of C55X DSP C5510 (with 320KB of SRAM and
34 Background
64KB of DARAM) and C5503 (with 64KB SRAM and 64KB of DARAM) for high-end
and mid-end applications. As a consequence, the platform designer is not just interested
in a single optimal design point but a set of design points. These set of design points
are termed as non-dominated design points [30] as no design point is better than any of
the non-dominated points on all objective criteria. These non-dominated points form the
Pareto Optimal set. Conditions of Pareto optimality [30] is mathematically defined as
follows. Let a vector x be partially less than y, symbolically, x <p y, when the following
conditions hold:
(x <p y) ⇐⇒ (∀i)(xi <= yi) ∧ (∃i)(xi < yi)
Using the partial relation <p, we can say if x <p y then x dominates y or y is a dominated
point. If the set of all dominated points are removed from the set of all points in the
design, we get the non-dominated set or the Pareto-optimal design points. In other
words, the complement of the dominated set with respect to the design space, gives the
Pareto-optimal set.
Each non-dominated point is an optimal design point with a specific price-performance
factor. Thus platform based design requires a set of non-dominated design points is
computed in an automated and in reasonable computation time.
Chapter 3
Data Layout for Embedded
Applications
3.1 Introduction
As discussed in Chapter 1, embedded applications are highly performance critical and the
processor resources needs to be utilized optimally to extract maximum performance. One
of the most critical step in embedded application development flow is system integration,
where all the software modules are integrated and mapped to a given target memory
architecture. This step has a large performance implication depending on how the memory
architecture is used. The memory architecture of embedded DSPs is heterogeneous and
contain memories with different access times. For example, an embedded system may
contain on-chip and off-chip memory modules with different access times, single and dual
ported memory, and multiple memory banks to support many simultaneous accesses.
During system integration, the decision to map critical data on to faster memories and
map non critical data in to slower memories is made. But, it is not easy to classify the
data into critical and non critical because of following reasons: (a) typically 70 to 80%
of code and data is legacy and may not have clear specification, (b) most of the code
in embedded DSPs are developed in assembly and hence compiler based analysis is not
possible [49] and (c) because of faster time to market constraints many of the software
36 Data Layout for Embedded Applications
modules are procured as IPs from software vendors.
The need for integrating multiple IP modules as part of embedded application de-
velopment was discussed in Section 1.3. Typically the IP modules are optimized on a
stand-alone embedded processor with a generic memory architecture. However, when
these IPs get integrated as part of the system, the memory architecture may be differ-
ent from the original target platform on which the performance of the IP module was
characterized. Hence, during system integration, the module that is integrated may not
perform at the same level as quoted in the specification given by the IP vendor. Since
the IPs/modules are independently optimized, the integrator is under pressure to deliver
the complete product with the components performing at the same level as it was inde-
pendently. This is a big challenge. When the modules are optimized independently, the
module developer has the whole SoC resources (MIPS/Memory) to optimize the mod-
ule. However, the application integrator has to consider all the modules, so that the
system resources has to be shared among different components. Hence the application
integrator needs to know the module’s requirements with respect to MIPS and memory
unambiguously to be able to allocate the shared resources to critical needs. Usually the
module’s memory requirements are given only at a high level. To be able to optimize
the whole application/system, the integrator will need detailed memory analysis at the
module level, like which data buffers need to be placed in dual ported memories and
which data buffers should not be placed in the same memory bank.This data is usually
not available. Further, the critical code is usually written in low-level assembly language
to meet real-time constraints and/or due to legacy reasons. In order to obtain good per-
formance and a reduction in memory stalls, the data buffers of the application need to be
placed carefully in different types of memory; this is known as the data section layout
problem. Typically data section layout is performed manually. Because of the above
mentioned reasons, the application integration/optimization takes significant amount of
time (approximately 1-2 man months) analyzing the application and mapping software
modules to custom memory architecture in order to obtain optimal cost and performance.
In summary there are two issues: (1) time taken is significantly more – not acceptable
3.1 Introduction 37
for current day time to market requirements, (2) quality of solution varies based on the
expertise.
The data layout optimization methods [10, 40, 44, 53, 58] varies significantly for ap-
plications built for microcontrollers (MCU) and Digital Signal Processors (DSPs) due to
the following reasons: (a) DSP applications are more data dominated than the control
software executed in MCU. Memory bandwidth requirements for DSP applications range
from 2 to 3 memory accesses per processor clock cycle. While that for MCU at best needs
only one memory access per cycle. (b) The DSP software for critical kernels is developed
mostly as hand optimized assembly code. Whereas the MCU SW is developed in high
level languages. Hence compiler based optimizations may not be directly applicable for
the DSP kernels.
In this chapter we address the data layout problem for DSP memory architectures
in different methods. First, we formulate the data section layout as a Integer Linear
programming (ILP) problem. The proposed ILP formulation can handle: (i) on-chip and
off-chip memory, (ii) multiple on-chip memory banks, (iii) single and dual access RAMs,
(iv) overlay of data sections with non-overlapping life times, and (v) swapping of data
(from/to off-chip memory). The main contribution of this work is the development of
a simple unified ILP formulation. The formulation can optimize performance or cost,
although in our work we concentrate on performance. We have developed a framework
which automatically generates the ILP formulation for an embedded application. The
ILP formulation is solved using a public domain LP solver, viz., lp solve.
The ILP based approach is very effective for many moderately complex test cases and
delivers optimal results. However, as the application complexity increases, the execution
time of ILP method becomes an issue – for some of the test cases, the run-time is more
than 24 hours – and in some cases the ILP does not yield a valid solution even after
running for 30 hours. Hence, we also formulate the data layout problem in Genetic
Algorithm (GA). Finally, as the data layout problem is the kernel for solving the memory
architecture exploration problem and needs to be invoked several times.Hence we looked
at developing faster methods to solve this problem. In this chapter we also propose a
38 Data Layout for Embedded Applications
heuristic algorithm that maps the data sections to the given memory architecture and
reduces the number of memory access conflicts (both self conflicts and parallel conflicts).
We compare the results of the heuristic, GA and ILP.
The rest of this chapter is organized as follows. The following section deals with the
necessary background and the problem statement. The ILP formulation is presented in
Section 3.3. In Section 3.4 we present the Genetic Algorithm Formulation of the data
layout problem. The Greedy back-tracking heuristic is discussed in Section 3.5. We
report the experimental results in Section 3.6. In Section 3.7 we discuss the related work.
Finally concluding remarks are presented in Section 3.8.
3.2 Method Overview and Problem Statement
3.2.1 Method Overview
Figure 3.1 explains the datalayout method in a block diagram. Initially, the application’s
data is grouped in to logical sections. This is done to reduce the number of individual
items and there by reduce the complexity. This step is important as once the data is
grouped into a section, the section can only be assigned a single location and all the data
variables inside a section will be placed contiguously starting from the given memory
address. Also the order of data placement within a section can be random and may not
affect the performance. Note that, a section cannot have both code and data. There is
a trade-off in combining different variables into a section. If too many data variables are
combined into one section, then the flexibility of placement in memory gets negatively
impacted. On the other hand, if each of the data variable is mapped into one section then
there are too many sections to handle and thus increasing the data layout complexity. In
practice, however an embedded development engineer makes a judicious choice of mapping
a set of data variables into a section. Typically, each of the large data arrays are mapped
into an individual section, and all data scalar variables belonging to a module are mapped
into a section. Note that this process is performed manually.
Once the grouping of data into sections are done, the code is compiled and executed
3.2 Method Overview and Problem Statement 39
Figure 3.1: Overview of Data Layout
for (i=0; i<n; i++) y[i] = b[i] + a[i] ∗ a[i-1]
Figure 3.2: Illustration of Parallel and Self Conflicts
in a cycle accurate software simulator. From the software simulator profile data (access
frequency) of data sections are obtained. In addition, the simulator generates a conflict
matrix that represents the parallel and self conflicts. Parallel conlicts refers to simul-
taneous accesses of two different data sections while self conflicts refers to simultaneous
accesses of same data sections. Consider the code segment in Figure 3.2.
In this code segment data sections a and b need to be accessed together and therefore
represent a parallel conflict. Accesses to a[i] and a[i-1] refer to a self conflict. If these
arrays (a,b) is placed in different memory banks or memory bank with multiple ports
then these accesses can be made concurrently without incurring additional stall cycles.
However, note that the data array (a) which has a self conflict must be placed in a memory
bank with multiple ports to avoid additional stall cycles.
The conflict relations among data sections is represented by an n×n matrix, where n
is the number of data sections. The (i, j)th element represents the conflict or concurrent
accesses between datasection i and j. The diagnol elements represent self conflicts. The
40 Data Layout for Embedded Applications
conflict matrix is symmetric.
As an example, consider an application with 4 data sections: a, b, c and d. A conflict
matrix is shown below, where the indices i and j are ordered as a, b, c and d. Section a
conflicts with itself and sections b and d. In this matrix, more specifically, a conflicts with
itself 100 times, while it conflicts with b and d 40 and 2000 times respectively. The sums
of all the conflicts for data sections a, b, c, and d are 2140, 540, 650 and 2050 respectively.
Hence the sorted order of the data sections in terms of total conflicts is a, d, c, b.
C =
100 40 0 2000
40 500 0 0
0 0 600 50
2000 0 50 0
Data section sizes, access frequency of data sections, conflict matrix and the memory
architecture are given as inputs to data layout. The objective of data layout is to efficiently
use the memory architecture by placing the most critical data section in on-chip RAM
and reduce bank-conflicts by placing conflicting data in different memory banks. Data
layout assigns memory addresses for all the data sections.
3.2.2 Problem Statement
Consider a memory architecture M with m on-chip SARAM memory banks, n on-chip
DARAM memory banks, and an off-chip memory. The size of each of the on-chip memory
bank and the off-chip memory is fixed. The access time for the on-chip memory banks
is one cycle, while that for the off-chip memory is l cycles. Given an application with d
sections, the simultaneous access requirement of multiple arrays is captured by means of a
two-dimensional matrix C where Cij represents the number of times data sections i and j
are accessed together in the same cycle in the execution of the entire program. We do not
consider more than two simultaneous accesses, as the embedded core typically supports
up to two accesses in a single cycle. If data sections i and j are placed in two different
memory banks, then these conflicting accesses can be satisfied simultaneously without
3.3 ILP Formulation 41
incurring stall cycles. Cii represents the number of times two accesses to data section i
are made in the same cycle. Self-conflicting data sections need to be placed in DARAM
memory banks, if available, to avoid stalls. The objective of the data layout problem is
to place the data sections in memory modules such that the following are minimized:
• Number of memory stalls incurred due to conflicting accesses of data sections placed
in the same memory bank
• Self-conflicting accesses placed in SARAM banks
• Number of off-chip memory accesses.
Note that the sum of the sizes of the data sections placed in a memory bank cannot
exceed the size of the memory bank.
3.3 ILP Formulation
In this section we present our data layout formulation in a step-by-step manner. We
start with the simplest problem and include the different optimizations one by one. The
following are the different steps:
1. Basic formulation – placing the data and code in memory considering only the
on-chip and off-chip memory
2. Modeling multiple on-chip memory banks
3. Handling Single and Dual Access RAMs
4. Overlay of data sections with non-overlapping life times
5. Swapping of code and data (from/to external memory)
The ILP formulation for the optimal data section layout problem, requires a number
of application related parameters. We describe these first. An embedded application is
composed of M modules. Let NDj represent the number of data sections in module j.
42 Data Layout for Embedded Applications
Table 3.1: List of Symbols UsedApplication Parameters (Constants)M Number of application modulesSDjs Size of data section s in module jADjs Access count of data section s in module jNDj Number of data sections in module j
Bjst Number of simultaneous accesses to data sections
s and t in module j.SBkj Memory size required to account for scratch buffers in
in module j placed in bank k.SWkj Memory size required to account for swapped code or data in
in module j placed in bank k.Sjs Sjs is 1 if the data section s of module j is
a scratch buffer and 0 otherwiseApplication Parameters (Variables)IDjs 1-0 variable to indicate if data section s
in module j is placed in on-chip memoryIDkjs 1-0 variable to indicate if data section s
in module j is placed in the kth internal bankEDjs 1-0 variable to indicate if data section s
of module j is placed in the external memoryZkjst 1-0 variable to indicate if data sections s
and t of module j are both placed in thekth internal memory bank
Architecture Parameters (Constants)SMi Size of on-chip memorySMe Size of off-chip memoryWe Number of cycles for off-chip memory accessNb Number of internal memory banksSMik Size of the ith internal memory bankDPk DPk is 1 if kth internal bank is DARAM
DPk is 0 otherwise.
As mentioned earlier, in our discussion, we follow the convention that each data section
refers to a single array. Let the size of data section s in module j be denoted by SDjs.
The access count for a data section is denoted by ADjs. During memory layout, a data
section occupies a block of contiguous memory locations. The value for some of the above
parameters, e.g., the access counts, can be obtained by profiling the application. In our
framework the profile data is collected using an Instruction Level Simulator.
3.3 ILP Formulation 43
For the ILP formulation we also require memory architecture parameters. The size of
internal (on-chip) and external (off-chip) memory are denoted by SMi and SMe respec-
tively. We also need the number of stall cycles We for each access to external memory.
Table 3.1 summarizes the list of symbols used in our formulation. With this, we are ready
to describe the basic formulation.
3.3.1 Basic Formulation
As mentioned earlier we formulate the optimal performance problem in terms of the
number of memory stall cycles. A data section s in module j placed in external memory
incurs ADjs ∗ We stall cycles. To indicate whether a data section is placed in internal
or external memory, we use a 0-1 integer variable IDjs; IDjs is 1 if the data section is
placed in on-chip memory and 0 otherwise. Thus the number of stall cycles due to data
section s is ADjs · We · (1 − IDjs). The objective of the formulation is to minimize the
total number of memory stalls. That is,
min
N∑
j=1
Ndj∑
s=1
ADjs · We · (1− IDjs)
(3.1)
Next we specify the memory constraints. Equations (3.2) and (3.3) enforce the con-
straint that the total size of the code and data sections that are placed in the external and
internal memory do not exceed the available external and internal memory respectively.
N∑
j=1
Ndj∑
s=1
SDjs · (1− IDjs) ≤ SMe (3.2)
N∑
j=1
Ndj∑
s=1
SDjs · IDjs ≤ SMi (3.3)
Lastly, we add the constraint that IDjs are 0-1 integer variables.
44 Data Layout for Embedded Applications
3.3.2 Handling Multiple Memory Banks
Embedded DSP applications are data intensive; typically two to three data sections are
accessed simultaneously in a cycle. DSP processors are designed to handle multiple data
accesses. DSPs have internal memory with multiple banks and multiple internal data
buses. The data variables that are accessed simultaneously need to be placed in different
memory banks to avoid memory stalls.
This section handles the partitioning of concurrently accessed data arrays from mul-
tiple memory banks to avoid additional stalls. Only two simultaneous data accesses are
considered; but this can be extended easily for more than two accesses. To represent the
number of simultaneous accesses to two different data sections, in a module j, we use a
2-dimensional matrix Bj which is of size Ndj × Ndj. An element of this matrix, Bjst, for
s 6= t, represents the number of simultaneous accesses to data sections s and t. Note that
Bjst refers to the total number of simultaneous accesses to the different elements of data
sections s and t. For example, if two data sections s1 and s2, each of size 100 elements,
are accessed simultaneously (as in s1[i]+s2[i]), then Bjs1s2 = 100.
Bjss refers to the number of simultaneous accesses to the same data section s. We will
consider this in our formulation in the next subsection. For the time being, we will assume
Bjss = 0 for all data section s and for all modules j. The elements of the Bj matrix are
fixed (constants), and can be obtained by profiling the application.
Let us assume that the internal memory consists of Nb banks, and the size of the ith
memory bank is SMik. The total size of the internal memory is
SMi =Nb∑
k=1
SMik
Further, let IDkjs represent whether data section s of module j reside in the kth internal
bank. Lastly, we use a (derived) 1-0 variable Zkjst to represent whether data sections s
and t of module j are both placed in internal bank k. Zkjst is 1 if and only if IDkjs = 1
3.3 ILP Formulation 45
and IDkjt = 1. This can be expressed by the linear inequality
Zkjst ≥ Ikjs + Ikjt − 1 (3.4)
Note that Zkjss = 1 for all sections. We replace Equations (3.2) and (3.3) in the basic
formulation with:N∑
j=1
Ndj∑
s=1
SDjs · EDjs ≤ SMe (3.5)
N∑
j=1
Ndj∑
s=1
SDjs · IDkjs ≤ SMik (3.6)
where EDjs is 1 if the sth data section reside in off-chip memory. These variables can be
expressed in terms of IDkjs as:
EDjs =
1−
Nb∑
k=1
IDkjs
(3.7)
Note that inequality (3.6) must hold for all k from 1 to Nb, the number of internal memory
banks. To enforce that each data section reside in at most 1 internal memory bank, we
add the constraint:Nb∑
k=1
IDkjs ≤ 1 (3.8)
Inequalities (3.8) must hold for all relevant values of j and s.
Lastly, the objective function in this formulation also accounts for the stalls incurred
by not placing sections s and t in different memory banks. The second term in the
following objective function accounts for this, while the first term account stall cycles due
to external memory in data sections. Thus the formulation is:
min(∑N
j=1
∑Npj
s=1 ADjs · We · EDjs+
∑Nj=1
∑Nbk=1
∑Ndj
s=1
∑Ndj
t=s+1 Bjst · Zkjst
)(3.9)
subject to constraints (3.4) to (3.8). Note that the first term includes all accesses (simul-
taneous and non-simultaneous accesses) to data section s, while the second term excludes
46 Data Layout for Embedded Applications
non-simultaneous accesses.
3.3.3 Handling SARAM and DARAM
In this formulation we account for the cost of simultaneous accesses to the same data sec-
tion. Let Bjss denote the number of such accesses. These accesses will incur an additional
stall cycle if the data section s does not reside in a memory bank that supports dual
access. Likewise, a simultaneous access to data sections s and t will incur an additional
stall cycle when they both are in memory bank k which is single ported. Let DPk = 1
denote that memory bank k is dual ported and DPk = 0 otherwise. Note that for a given
memory architecture DPk is a constant (0 or 1) and known a priori.
min(∑N
j=1
∑Npj
s=1 ADjs · We · EDjs+
∑Nj=1
∑Nbk=1
∑Ndj
s=1
∑Ndj
t=s Bjst · (1−DPk) · Zkjst
)(3.10)
subject to constraints (3.4) to (3.8).
3.3.4 Overlay of Data Sections
As mentioned earlier data sections that have non-overlapping life-times can share the same
on-chip memory space. These arrays are commonly referred to as scratch buffers. In our
discussion, we assume that scratch buffers are identified by the application developer. Let
Sjs = 1 denote that data section s is a scratch buffer; Sjs = 0 otherwise. The memory
used by a scratch buffer can be reused across different modules, but not within the same
module.
We account for the internal memory required for the scratch buffers in the following
way. For each module j, we compute SBkj, the sum of the sizes of the scratch buffers
in module j that are stored in the kth internal memory bank. The memory required for
scratch buffers in the kth internal bank corresponds to the maximum of SBkj over all
modules. That is:
3.3 ILP Formulation 47
SBk = maxj
Ndj∑
s=1
SDjs · Ikjs · Sjs
(3.11)
Further, the individual memory requirements for each scratch buffer which is stored in
kth internal memory bank can be excluded in constraint for internal memory (Inequal-
ity (3.6)). Thus Inequality (3.6) is replaced by
N∑
j=1
Ndj∑
s=1
Djs · (1− Sjs) · IDkjs + SBk
≤ SMik (3.12)
The constraint for external memory remains same (Inequality (3.5)). Thus the ILP for-
mulation in this case has the same objective function (Equation (3.10)) subject to con-
straints (3.4), (3.5) – (3.8), (3.11), and (3.12).
3.3.5 Swapping of Data
Swapping of a data section is generally applied in embedded DSP systems that have
large external memory and very small internal memory. Here the data that is identified
for swapping resides in external memory and copied into the internal memory (on-chip
RAM) only for the duration of execution/access of a section. A data section is identified
for swapping by carefully weighing the swapping cost against the performance benefit that
results from accessing the section from internal memory. To model swapping, we assume
that one common swap memory space SWk is allocated in the kth internal memory. The
size of SWk is the maximum of the total size of all swapped sections in a module, where
the maximum is taken across all modules. The formulation for swapping proceeds in a
manner similar to scratch buffer, where swapped section share the same memory area in
the on-chip memory bank. Additionally we have to account for off-chip requirement for
all swapped section (∑Nb
k=1 SWk). Lastly, the objective function should account for the
cost of swapping.
48 Data Layout for Embedded Applications
3.4 Genetic Algorithm Formulation
Genetic Algorithms (GAs) have been used to solve hard optimization problems [30]. Ge-
netic algorithms simulate the natural process of evolution using genetic operators such
as, natural selection, survival of the fittest, mutation and crossover in order to search the
solution space.
To map an optimization problem to the GA framework, we need the following: chro-
mosomal representation, fitness computation, selection function, genetic operators, the
creation of the initial population and the termination criteria.
For the memory layout problem, each individual chromosome should represent a mem-
ory placement. A chromosome is a vector of d elements , where d is the number of data
sections. Each element of a chromosome can take a value in (0 .. m), where 1..m repre-
sent on-chip memory banks (including both SARAM and DARAM memory banks) and 0
represents off-chip memory. Thus if the element i of a chromosome has a value k, then the
jth data section is placed in memory bank k. Thus a chromosome represents a memory
placement for all data sections. Note that a chromosome may not always represent a
valid memory placement, as the size of data sections placed in a memory bank k may
exceed the size of k. Thus the genetic algorithm should consider only valid chromosomes
for evolution. This is achieved by giving a low fitness value for invalid chromosomes. Our
initial experiments demonstrated that the above chromosome representation (vector of
decimal numbers) is more effective than the conventional bit vector representation [30]
as the latter will lead to assignment of non-existent memory banks when the number of
memory banks is not a power of 2.
Genetic operators provide the basic search mechanism by creating new solutions based
on the solutions that exist. The selection of the individuals to produce successive gener-
ation plays an extremely important role. The selection approach assigns a probability of
selection to each individual, depending on its fitness. An individual with a higher fitness
has a higher probability of contributing one or more offspring to the next generation. In
the selection process a given individual can be chosen more than once. Let us denote the
3.5 Heuristic Algorithm 49
size of the population (number of individuals) as P . Reproduction is the operation of pro-
ducing offspring for the next generation. This is an iterative process. In every generation,
from the P individuals of the current generation, M more offspring are generated. This
results in a total population of P + M . From this total population of P + M , P fittest
individuals survive to the next generation. The remaining M individuals are annihilated.
In our data layout problem, for each of the individuals, the fitness function computes the
number of resulting memory conflicts. Since GAs typically solve a maximization problem,
we change our problem as a maximization problem by negation and normalization. Recall
that a chromosome may represent an invalid solution. To discourage invalid individuals,
we associate a very low fitness value to them.
The Crossover operator takes two individuals and produces two new individuals by
merging the characteristics of the two parents at a random point (named crossover site).
Mutation is applied after crossover to each individual with a given probability. Mutation
changes an individual to produce a new one by changing some of its genes. Lastly, the
GA must be provided with an initial population that is created randomly. GAs move
from generation to generation until a pre-determined number of generations is seen or
the change in the best fitness value is below certain threshold. In our implementation we
have used a fixed number of generations as the termination criterion.
We have also developed a simulated annealing (SA) approach for the data layout
problem and experimented the same for some of the applications. The performance of
SA is comparable to that of GA, however SA takes more time to arrive at the solution.
Hence we did not consider the SA approach for data layout further in this thesis.
3.5 Heuristic Algorithm
As mentioned earlier the data layout problem is NP Complete. Further the ILP and
the GA methods described in previous sections consume significant run-time to arrive
at a solution and these methods are suitable only for obtaining an optimal data layout
for a fixed memory architecture. But to perform memory architecture exploration, this
problem is addressed in the following chapters, data layout needs to be performed for
50 Data Layout for Embedded Applications
1000s of memory architecture and it is very critical to have a fast heuristic method for
data layout. Using exact solving method such as Integer Linear Programming (ILP) are
using an evolutionary approach, such as GA or SA which takes as much as 20 to 25 minutes
of computation time for each data layout problem, may be prohibitively expensive for the
memory architecture exploration problem. Hence in this section we propose a 3-step
heuristic method for data placement.
3.5.1 Data Partitioning into Internal and External Memory
The first step in data layout is to identify and place all the data sections that are frequently
accessed in the internal memory. Data sections are sorted in descending order of frequency
per byte (FPBi) defined as the ratio of number of accesses to the size of the data section.
Based on the sorted order, data sections are greedily identified for placement in internal
memory till free space is available. We refer all the on-chip memory banks together
as internal memory. Note that the data sections are not placed at this point but only
identified for internal memory placement. The actual placement decision are taken later
as explained below.
Once all the data sections to be placed in internal memory are identified, the remaining
sections are placed in external memory. The cost of placing data section i in external
memory is computed by multiplying the access frequency of data i with the wait-states
of external memory. The placement cost is computed for all the data sections placed in
the external memory.
3.5.2 DARAM and SARAM placements
The objective of the next two steps is to resolve as many conflicts (self-conflicts and
parallel-conflicts) by utilizing the DARAM memory and the multiple banks of SARAM.
Self-conflicts can only be avoided if the corresponding data section is placed in DARAM.
On the other hand, parallel conflicts can be avoided in two ways either by placing the
conflicting data a and b in two different SARAM banks or by placing the conflicting data
a and b in any DARAM bank. But former solution is attractive as the SARAM area
3.5 Heuristic Algorithm 51
cost is much less compared to DARAM area cost. Considering that self-conflicts can only
be avoided by placing in DARAM and the cost of DARAM is very high, data placement
decisions in DARAM needs to be done very carefully. Also, many of the DSP applications
have large self-conflicting data and the DARAM placements is very crucial for reducing
the run-time of an application.
The heuristic algorithm considers placement of data in DARAM as the first step in
internal memory placements. Data sections that are identified for placement in internal
memory are sorted based on the self-conflict per byte (SPBi), defined as the ratio of self
conflicts to the size of the data section. Data sections in the sorted order of SPBi are
placed in DARAM memory until all DARAM banks are exhausted. Cost of placing data
section i in DARAM is computed and added as part of the overall placement cost.
Once the DARAM data placements are complete, SARAM placement decisions are
made. Figure 3.3 explains the SARAM placement algorithm. Parallel conflicts between
data sections i and j can be resolved by placing conflicting data sections in different
SARAM banks. The SARAM placement is started by sorting all the data sections iden-
tified for placement in internal memory based on the total number of conflicts (TCi), the
sum of all conflicts for this data section with all other data sections including self con-
flicts. Note that all data sections including the ones that are already placed in DARAM
are considered while sorting for SARAM placement. This is because the data placement
in DARAM is only tentative and may be backtracked in the backtracking step if there is
a larger gain (i.e., more parallel conflicts resolved) in placing a data section i in DARAM
instead of one or more data sections that are already placed in DARAM.
During the SARAM placement step, if the data section under consideration is already
placed in DARAM then it is ignored and the next data section in the sorted order is
considered for SARAM placement. The placement cost for placing data section i in
SARAM bank b computed considering all the data sections already placed in DARAM
and SARAM banks. Among this the memory bank b that results in the minimum cost is
chosen.
Next, the heuristic backtracks to find if there is any gain in placing data i in DARAM
52 Data Layout for Embedded Applications
Algorithm: SARAM Placement1. sort the data sections in data-in-internal-memory in descending order of TCi
2. for each data section i in the sorted order do3. if data section i is already placed in DARAM4. continue with the next data section5. else compute min-cost: minimum of cost(i, b) for all SARAM banks6. endif7. find if there is potential gain in placing data i in DARAM
by removing some of already placed sections8. if there is potential gain in back-tracking9. identify the data-set-from-daram-to-be-removed10. find the alternate cost of placing the data-set-from-daram-to-be-removed in SARAM11. if alternate cost > min-cost(i)12. continue with the placement of data i in SARAM bank b
13. update cost of placement: Mcyc = Mcyc + cost(i, b)14. else // there is gain in backtracking15. move the data-set-from-daram-to-be-removed to SARAM16. update cost of placement:
Mcyc = Mcyc + cost(g, b), for all g in data-set-from-daram-to-be-removed17. place data i in DARAM and update cost of placement18. endif19. else no gain in backtracking, continue with the normal flow20. continue with the placement of data i in SARAM bank b
by removing some of the already placed data sections from DARAM. This is done by
considering the size of data section i and the minimum placement cost of data i in SARAM.
If there is one or more data sections (refer this set of data as daram-remove-set) in
DARAM with size more than the size of data section i and the sum of self-conflicts for all
these data sections are less than the minimum placement cost of data i in SARAM, then
there can potentially be a possibility of gain of placing data i into DARAM by removing
the daram-remove-set. Note that it is only a possibility and not a certain gain.
Once a daram-remove-set is identified, to ensure that there is gain in the backtracking
step, the data sections that are part of daram-remove-set needs to be placed in SARAM
banks and minimum placement cost has to be computed again for each of the data section.
If the sum of the minimum placement cost for all data sections in daram-remove-set is
greater than the original minimum cost for placing data i in SARAM bank b, then there
is no gain in backtracking and data section i is placed in SARAM banks. Else there is
gain in backtracking and the daram-remove-set is removed from DARAM and placed in
SARAM. Data section i is placed in DARAM. The overall-placement cost is updated. This
process is repeated for all data sections identified to be placed in the internal memory.
The overall-placement cost gives the memory cycles (Mcyc) for placing application data
for a given memory architecture.
3.6 Experimental Methodology and Results
3.6.1 Experimental Methodology
In this section, we explain the methodology used in our experiments. For our experiments,
the main inputs are the access characteristics of the data sections. We need the sizes of
the data sections, access frequency of each of the data sections and the conflict matrix.
The access frequency and the conflict matrix are obtained from a software profiler. Since
the DSP applications typically have simple control flow, the profile information on the
access characteristics does not change very much from run to run.
54 Data Layout for Embedded Applications
Table 3.2: Memory Architecture for the Experiments
Bank Type Number of Banks Bank Size (Words)DARAM 4 4096SARAM 2 32768
We have used the Texas Instruments TMS320C55XX processor [76] for our experi-
ments. This processor has three 16-bit memory read busses and two 16-bit memory write
busses and has the capability to read three 16-bit data and write two 16-bit data in the
same clock cycle. The memory architecture of the 55X device is given in Table 3.2. Note
that the total memory size is 72 Kwords and is large enough to fit each of the instances
of all the four applications reported in Table 3.4.
We have used the Texas Instruments Code Composer Studio V2.2 [73] to run the
applications. Initially the applications are compiled with the CCS2.2 compiler with the
default memory placement made by the compiler. The compiled application is loaded
and simulated in the simulator to obtain the profile information and the conflict matrix,
which are inputs to the heuristic and the Genetic algorithms.
We have developed a framework which automatically generates the ILP formulation for
an embedded application. The input for the ILP formulation generator, specified in XML
format, are application parameters, memory configuration parameters, and the profile
data parameters. To obtain profile information, we developed a profiler and integrated it
with the C5510 Instruction Set Simulator (ISS) [73].
The ILP formulation is solved using a public domain LP solver [3], viz., lp solve.
3.6.2 Integer Linear Programming - Results
To compare the performances of the different layouts, we consider the number of memory
stall cycles and sometimes MIPS consumed, which is commonly used in embedded systems
design. MIPS consumed refers to the processing capability required to guarantee real-
time performance for a given application. Thus higher the MIPS consumed, lower is the
performance of the layout. Further, lower the MIPS consumed, more applications, or
3.6 Experimental Methodology and Results 55
higher number of instances of the same application can be run on the embedded device,
guaranteeing real-time performance for all of these instances. MIPS consumed for a given
data layout is obtained by running the application on the simulator with the given layout
and for the given memory architecture. We report these numbers when a single instance
of the application is run on the embedded system.
First, we report the MIPS consumed by the optimal solution obtained using the fol-
lowing formulations for the four applications.
• The basic formulation is used to get a data layout considering only the internal and
external memory. The internal memory is considered as one single SARAM bank
of size 12K words. We use this as the baseline model. In Table 3.3, we report the
normalized MIPS consumed, the number of variables and the number of constraints
for each ILP formulation, and the time taken on a 900 MHz Pentium III machine1
to solve the ILP problem.
• The on-chip memory is split into multiple SARAM banks (three 4K word SARAM
banks) and the formulation that handles multiple memory banks was used. The
results for this case (refer to Table 3.3) show 14%, 16%, 3.8%, and 7.1% performance
improvement over the baseline case for the four applications.
• Next, the basic formulation is extended to handle SARAM and DARAM banks. For
this formulation we assumed that the internal memory consists of 2 banks, one, of 8K
word size, supporting single access (SARAM) and another, of 4K word, as DARAM.
This optimization gives a 16%, 18%, 4.8% and 7.4% performance improvement over
the baseline case for the four applications.
• The last experiment is performed by enhancing the basic formulation to handle both
multiple banks, and bank types (SARAM and DARAM). The memory configuration
considered here consists of two 4K word SARAM banks and one 4K word DARAM
1These experiments were run in 2002 on a set of proprietary benchmark with a desktop configurationwhich was state of the art at that time. We are unable to repeat the experiments on a more modernplatform due to portability reasons. We have however run another set of benchmarks on a recent desktopconfiguration and the results are reported in Section 3.6.3
56 Data Layout for Embedded Applications
bank. This optimization exploits both multiple memory banks and dual access
capabilities of the scratch pad memory and gives a significant reduction (28%, 30%,
9% and 13.8%) in MIPS consumption over the baseline case.
We remark that the somewhat lower performance improvement in Appln. 3 and Appln. 4
could be due to the fact that these are kernel codes where multiple simultaneous memory
Figure 3.4 shows the performance of the genetic algorithm when the number of gen-
erations is increased from 700 to 4000. The bar charts correspond to (a) the run-time of
the GA in logarithmic scale, and (b) the difference of the number of conflicts resolved by
the heuristic and the GA. Notice that when the number of generations is less than 3000,
the GA is in the catch-up mode. When the number of generations reaches 4000, the GA
outperforms the heuristic algorithm.
3.6.4 Comparison of Heuristic Data Layout with GA
We compared our heuristic data layout performance with GA’s data layout. Figure 3.5
presents the normalized performance of heuristic data layout. We randomly picked 100
different architectures for obtaining data layout for the Voice Encoder application. For
these 100 points, we ran both GA and our heuristic algorithm. The x-axis represents test
60 Data Layout for Embedded Applications
Figure 3.4: Relative performance of the Genetic Algorithm w.r.t. Heuristic, for VaryingNumber of Generations
Data Layout (Heuristic vs SA)
0
0.2
0.4
0.6
0.8
1
1.2
0 20 40 60 80 100 120
Test No
No
rmal
ized
per
form
ance
Figure 3.5: Comparison of Heuristic Data Layout Performance with GA Data layout
case identifier from 1 to 100. The y-axis presents the memory stall cycles of heuristic
normalized by GA’s memory cycles (M gacyc/M
heucyc ). It can be observed that the heuristic
method performs as well as GA for most of the points. The worst performance of heuristic
is approximately 25% below GA’s performance for two of the test cases. However, an
average heuristic performs at 98% efficiency, in terms of solution quality as compared to
GA. The execution time of GA for completing all 100 data layout is approximately 22
hours, whereas the execution time of heuristic for completing all 100 placements is less
than a second on a Pentium P4 Desktop machine with 1GB main memory operating at
1.7 GHz. Thus the heuristic method is an attractive option, providing efficient solution
3.6 Experimental Methodology and Results 61
in a very little execution time, for solving large number of data layout problems which is
required in a memory architecture exploration problem. Note that for some of the points
heuristic performs better than GA, this may be due to the termination of GA at fixed
iterations. Based on the above results we can conclude that the heuristic algorithm is fast
and also very efficient.
3.6.5 Comparison of Different Approaches
In Table 3.5, we provide a qualitative comparison of the heuristic algorithm, the genetic
algorithm, and the ILP based for the data layout problem. We see that the run-time
of the heuristic is the lowest among the three approaches. The run-time of the Genetic
algorithm depends on the number of generations, the population size P , and the number
of offspring per generation M . For the four large test cases of Table 3.4, the run-time
of the heuristic algorithm was of the order of 1 second, whereas the GA took about 20
minutes to complete. The ILP approach, on the other hand, required several hours to
converge to the optimal solution. As a matter of fact, a public domain ILP solver could
not converge in 24 hours for the 6-instance Voice Encoder and the 32-instance Levinson’s
LPC applications. This clearly demonstrates that the GA and the heuristic methods
are attractive from the point of view of quickly solving th data layout problem. From
the view point of optimality of solutions, ILP is guaranteed to converge to the optimal
and is hence ranked the best. The GA comes second since it provides better solutions for
larger problem instances. From the viewpoint of flexibility, we believe GA is most flexible,
since the cost function can be easily reflected in the fitness measure. For example, if the
power or energy dissipation must be minimized, we can modify the fitness function to
be a weighted average of the power and performance metrics and still reuse the genetic
algorithm framework. This becomes difficult for the heuristic algorithm to simultaneously
optimize both performance and power.
From Table 3.5, we see that each of the three approaches for the data placement
problem has a definite advantage over the other two methods in terms of run-time, quality,
or flexibility. This point can be exploited as follows. The algorithms presented in the
62 Data Layout for Embedded Applications
Table 3.5: Comparative Ranking of Algorithms
Optimization Run-Time Quality of FlexibilityApproach SolutionHeuristic Best Worst IntermediateGA Intermediate Intermediate BestILP Worst Best Worst
previous section are intended to optimize the placement of data sections for a given
memory architecture. Often, the designers of the SoC have the flexibility to change the
memory architecture. It would be ideal if the memory architecture optimization and
the data section placement were to be done concurrently. This is a classic example of
hardware-software codesign. We propose a “multi-objective Genetic Algorithm” technique
for memory architecture exploration, where the number and size of the SARAM and
DARAM banks can be determined through a combinatorial search process. For each of
the memory architecture considered, a quick “fitness” can be computed based on the cost
of the best placement obtained using the heuristic algorithm. The heuristic is an ideal
choice for computation of the bound since it is the fastest approach, requiring less than 1
second of run time. After a small number of competing memory architectures have been
shortlisted through this procedure, the GA can be used to explore the best placement of
data sections for each of the memory configurations.
3.7 Related Work
Several efficient heuristic approaches for data layout have been published in the literature
[10, 34, 40, 44, 53, 55, 58, 67, 71, 79]. These can be classified as static and dynamic
methods. In static data layout, the memory addresses for all data variables are decided at
compile time and do not change at run-time. In dynamic data layout [79], on-chip SPRAM
is reused by overlaying many data variables to the same address. Thus, two addresses are
assigned to a variable at compile time, namely, load address and run address. A variable
is loaded at the load address and copied to run-address at run-time. At the cost of
3.7 Related Work 63
increased complexity, overlaying attempts to improve the system efficiency by delivering
better run-time performance with lower SPRAM size.
Avissar et al., [10] present an Integer Linear Programming (ILP) based technique for
compiler to automatically allocate data to on-chip memory or external memory based
on the access frequency. In [10], the authors handle the data partitioning problem for
globals, and stack variables. They propose an approach to partition the stack into two
parts. They are the first to present a solution for partitioning stack variables into critical
and non-critical sections such that the critical stack variables can be mapped to on-
chip memory and the non-critical stack variables are mapped to off-chip memory. This
approach will require additional run-time as two stacks have to be managed. They also
propose an alternate approach, which doesn’t take additional run-time, where all local
variables of a function are allocated in one of the stack (on-chip/off-chip stack) with
some performance loss. Their memory architecture considers only on-chip and external
memory, without considering multiple memory banks. Multiple memory banks are part
of every DSP based embedded application and it is very critical to take this into account
during data layout. Also our observation is that ILP based approaches typically take
more time (a few hours) to converge for moderate to complex applications.
Leupers et al., [44] present an interference graph based approach for partitioning vari-
ables that are simultaneously accessed in different on-chip memory banks. For a given
interference graph, the Integer Linear Programming (ILP) approach is used to solve the
problem of maximizing the weights (number of parallel accesses). This work does not con-
sider DARAM, which is very important for DSP applications. Also ILP based approaches
may not be practical for complex applications. Also they consider improving only the
run-time; but in this thesis we look at both power and run-time and we show there is big
trade-off that is possible between power and performance in Chapter 5. To avoid the cycle
costs from self-conflicts (multiple simultaneous accesses to the same array), [58] suggests
partial duplication of data in different memory banks.
In [40], Ko et al., present a simple heuristic to partition data with the objective
of resolving parallel conflicts and also balance the size of the partitions. Balancing is
64 Data Layout for Embedded Applications
important as typically programmable processors have equal sized memory banks. This
work uses benchmarks written in ’C’ and hence the conflict graph is very sparse and only
bipartite graphs are obtained from the compiler. Because of this, they could resolve all
the conflicts and their main focus is only on balancing the data partitions. But typically
the DSP applications will have dense graph as the critical part of software is developed
in hand optimized assembly. Their work does not address parallel conflicts between more
than two arrays. Also they do not consider dual access RAMs. Their objective is to
reduce the data conflicts and improve run-time.
In [71], Sundaram et al., present an efficient data partitioning approach for data ar-
rays on limited-memory embedded systems. They perform compile time partitioning of
data segments based on the data access frequency. The partitioned data footprints are
placed in local or remote memory with the help of 0/1 Knapsack algorithm. Here the
data partitioning is performed at a finer granularity and because of this, the address com-
putation needs to be modified for functional correctness. In contrast, in our work, the
data partitioning is performed at the data section level and the data layout optimization
is performed by considering a more complex on-chip memory architecture with multi-
ple single and dual port memory banks. Further, no additional address computation or
modification to address computation is required in our approach.
Kulkarni et al., [41] present formal and heuristic algorithms to organize the data
in the main memory with the objective of reducing cache conflict misses. In [55], a
data partitioning technique is presented that places data into on-chip SRAM and data
cache with the objective of maximizing performance. Based on the life times and access
frequencies of array variables, the most conflicting arrays are identified and placed in
scratch pad RAM to reduce the conflict misses in the data cache. This work addresses
the problem of limiting the number of memory stalls by reducing the conflict misses
in the data cache through efficient data partitioning. Our work addresses the problem
of reducing the memory stalls by efficient data partitioning within the on-chip scratch
pad RAM itself. Also our work addresses the data layout for DSP applications, where
resolving self and parallel conflicts by efficient partitioning of data variables is very critical
3.8 Conclusions 65
for achieving real-time performance. Lastly, the memory architecture considered in the
initial part of our thesis does not have data cache; in Chapter 7, we consider memory
architecture with on-chip RAM and caches.
3.8 Conclusions
In this chapter, we described three approaches to solve the data placement problem in
embedded systems. Given memory architecture, the placement of data sections is crucial
to the performance of the system. Badly placed data can result in a large number of
memory stalls. We consider a memory architecture that consists of on- chip single-access
RAM with multiple memory banks, on-chip Dual-access RAM, and external RAM. We
analyze the application for data conflicts using a profiling tool and create a matrix rep-
resentation of the conflict information. We present three different methods to address
data layout problem: (a) ILP formulation, (b) Genetic Algorithm and (c) Greedy back-
tracking heuristic algorithm. The greedy back-tracking heuristic and Genetic Algorithm
approaches out perform the ILP based formulation in terms of the time to solve the data
layout problem. However, the ILP and the GA methods produce better quality of results
especially for large-sized applications. The framework of the GA is generic enough to
permit other cost functions such as power dissipation [64] to be incorporated. In Chap-
ter 5 we extend the GA formulation to consider performance and power minimization.
Similarly, the GA can also be extended to concurrently explore alternative memory ar-
chitectures [54] – this is possible by changing the representation of the chromosome and
reworking the crossover and mutation operations.
66 Data Layout for Embedded Applications
Chapter 4
Logical Memory Exploration
4.1 Introduction
In the previous chapter, we discussed data layout methods to find optimal and near-
optimal placement of data for a given fixed memory architecture for embedded DSP
processors. In this chapter we will focus on memory architecture exploration for a given
application in order to obtain memory architecture performance (reduced memory stalls)
and memory area. In Chapter 5 we extend our approach to consider the power consump-
tion also as an objective.
As discussed in Chapter 1, embedded systems are application specific and hence em-
bedded designers study the target application to understand the memory architecture re-
quirements. DSP applications are typically data intensive and require very high memory
bandwidth to meet real-time requirements. There are two steps to designing an optimal
memory architecture for a given application. The first step is to find the right memory ar-
chitecture parameters that are important for improving target application’s performance
and the second step is to optimally map the given application on to the memory architec-
ture under consideration. This leads to a two-level optimization problem with multiple
objectives. At the first level, an appropriate memory architecture must be chosen which
includes determining the number and size of each memory bank, the number of memory
ports per bank, the types of memory (scratch pad RAM or cache), wait-states/latency
68 Logical Memory Exploration
etc. Thus the number of memory architecture possible for an SoC for a given application
are many. The objective functions at this level are the memory system-cost, performance,
and power dissipation. However, the performance for a given application for a given ar-
chitecture depends on the appropriate placement of code and data sections in the various
on-chip memory banks or the off-chip memory modules. Hence, at the next level, for a
given application, the code and data sections must be placed optimally in memory to
minimize the number of stall cycles. As discussed in the previous chapter, the number
of placements for a given architecture are also many. Thus the number of memory ar-
chitectures and the number of data placements are formidably large. Hence an optimal
solution for the memory space exploration problem which involves exploring a large de-
sign space. A performance optimal solution may not be optimal in terms of cost or power
consumption. In this solution space there are several interesting design points known as
Pareto-optimal points for the embedded system design. This is especially the case as an
embedded system designer typically designs multiple variant of a embedded product (to
meet different market segments) and hence would want to obtain several good solutions
which may make sense for different application segments. Hence the memory space ex-
ploration problem should identify multiple Pareto-optimal design points. Further since
embedded system products are often designed under tight time-to-market constraints, the
resources available for such an optimization process are limited. To make the problem
more complex, the market space is volatile and frequently the top-level specification and
architecture may be redefined during the life cycle of a product.
In this chapter, we propose an efficient methodology for the memory architecture
exploration of the DSP core1. We concentrate mainly on the DSP core as it largely
determines the performance of the embedded application. We consider both on-chip and
off-chip memory space exploration for the DSP core. The memory architecture exploration
problem involves identifying the appropriate memory architectures for a given application,
in terms of performance, power consumption and cost. As mentioned earlier, this involves
1In addition to the DSP core, an embedded SoC will have a micro-controller which has embeddedmemory. In this thesis we do not focus on the memory system design of embedded micro controllersthough many of the methods proposed may be applicable to microcontrollers as well.
4.1 Introduction 69
solving two interacting problems: (a) memory architecture exploration and (b) data layout
optimization for the architecture considered.
Previous work on data layout [10, 41, 44, 53, 58, 71] has focused on addressing the
layout problem independently, for a given memory architecture, by considering the objec-
tive either as improving application run-time or energy consumption. Also the previous
work in this area has addressed the data layout either for memory architecture on the
embedded side (microcontroller), where they do not consider dual-port memories, or on
the DSP side, where the on-chip/off-chip partitioning is not considered. A detailed com-
parison with related work is presented in Section 4.6. To the best of our knowledge,
there is no work which considers integrating memory architecture exploration and data
layout to explore memory design space by targeting multiple objectives. This integrated
approach is very critical to navigate through the search space in the right direction to
explore memory design space to obtain multiple Pareto Optimal design points.
In this chapter we propose an iterative two level integrated approach for the data layout
and memory exploration problem. At the outer level, for the architecture exploration, we
have used multi-objective evolutionary algorithm. We propose both Genetic Algorithm
(GA) formulation and an Simulated Annealing (SA) formulation for this problem. For
the inner level, i.e., the data layout problem, we have used a simple and fast heuristic
algorithm described in Section 3.5; this is because the data layout’s problem is solved
for several thousands of memory architectures. As discussed in Chapter 3 the heuristic
algorithm proposed there performs reasonably well in reducing the memory stalls and at
the same time obtains the data layout in very little computation time (less than 1msec).
In comparison, the GA or the ILP approach takes few minutes to few hours for each data
layout problem which can become prohibitively expensive when solving for a large number
of memory architectures,
The main contributions of this chapter are (a) proposing an iterative two level based
solution to address the data layout and architecture exploration as an integrated problem
(b) proposing performance (in terms of memory stalls) and memory area as two objec-
tives for the memory exploration framework; and (c) proposing a memory exploration
70 Logical Memory Exploration
framework that is fully automatic. The proposed memory exploration framework is flex-
ible and can be configured to explore additional memory architecture parameters. Also
the framework is scalable and additional objectives like power consumption can be added
easily. We have used 4 different multimedia and communication applications for our ex-
periments. Our proposed memory exploration method gives up to 130-200 Pareto optimal
design choices (memory architectures) for each of the applications.
The rest of the chapter is organized as follows. Section 4.2 provides necessary back-
ground on the data layout and memory architecture exploration. Section 4.3 describes the
multi-objective Genetic Algorithm (GA) formulation of the memory exploration problem.
Section 4.4 explains the formulation of memory architecture exploration problem in Sim-
Sp Single-port memory bankDp Dual-port memory bankNs Number of SARAM banksBs SARAM bank sizeNd Number of DARAM banksBd DARAM bank sizeEs External Memory sizeDs Total data sizeWs Normalized weight for SARAMWd Normalized weight for DARAMWe Normalized weight for external memory
with the physical memory architecture, and hence we defer the power objective to the
following chapter on physical memory exploration.
Memory cycles (Mcyc) is the sum of all memory stall cycles where the CPU is waiting
for memory. This includes stall cycles spent in on-chip memory bank conflicts and off-chip
memory latency. Our objective is to minimize the number of stall cycles (Mcyc). It is very
critical to have an efficient data layout algorithm to obtain a valid Mcyc. Note that if an
efficient data layout algorithm is not used then the data mapping may not be optimal
leading to higher number of Mcyc even for a good memory architecture. This may lead
the memory architecture exploration search in a completely wrong direction.
Memory cost is directly proportional to the silicon area occupied by memory. Since
the memory silicon area is dependent on the silicon technology, memory implementation,
and the ASIC cell library that is used, instead of considering the absolute silicon area
numbers, for now, we consider the relative (logical) area. The memory cost is defined by
operators, the creation of the initial population and the termination criteria. Figure 6.5
explains the GA formulation of the Physical Memory Mapping problem.
6.2.3 Genetic Algorithm Formulation
6.2.3.1 Chromosome Representation
Each individual chromosome represents a physical memory architecture. As shown in
Figure 6.5, a chromosome consists of a list of physical memories picked from a ASIC
memory library. These list of physical memories are used to construct a given logical
memory architecture. Typically multiple physical memory modules are used to construct
a logical memory bank. As an example, if the logical bank is of size 8K*16bits then, the
physical memory modules can be two 4K*16bits or eight 2K*8bits or eight 1K*16bits and
so on. We have limited the number of physical memory modules per logical memory bank
6.2 Logical Memory Exploration to Physical Memory Exploration(LME2PME) 117
Figure 6.4: Logical to Physical Memory Exploration - Method
to at most k. Thus, a chromosome is a vector of d elements, where d = Nl ∗ k + 1 and
Nl is the number of logical memory banks, which is an input from LME. Note that each
of the element represents an index in the semiconductor vendor memory library which
corresponds to a specific physical memory module.
For decoding a chromosome, for each of the Nl logical banks, the chromosome has k
elements. As mentioned earlier each of the k element is an integer used to index into
semiconductor vendor memory library. With the k physical memory modules, a logical
memory bank is formed. We have used a memory allocator that performs exhaustive
combinations with the k physical memory modules to get the largest logical memory re-
quired with the specified word size. Here, the bank size, the word size and the number
of ports are obtained from the logical memory architecture, corresponding to the chosen
non-dominated point. In this process, it may happen that m out of the total k physical
118 Physical Memory Exploration
Figure 6.5: GA Formulation of LME2PME
memories selected may not be used, if with k−m physical memories the given logical mem-
ory bank can be constructed1. For example, if k=4, and if the 4 elements are 2K*8bits,
2K*8bits, 1K*8bits, and 16K*8bits and if the logical memory bank is 2k*16bits, then our
memory allocator builds a 2K*16bits logical memory bank from the two 2K*8bits and
the remaining two memories are ignored. Note that the 16K*8bit memory and 1K*8bit
memory is removed from the configuration as the logical memory bank can be constructed
optimally with the two 2K*8bit memory modules. Here, the memory area of this logical
1This approach of using only the required k−m physical memory modules relaxes the constraint thatthe chromosome representation has to exactly match a given logical memory architecture. This, in turn,facilitates the GA approach to explore many physical memory architecture efficiently.
6.2 Logical Memory Exploration to Physical Memory Exploration(LME2PME) 119
memory bank is the sum of the memory area of the two 2K*8bit physical memory mod-
ules2. This process is repeated for each of Nl logical memory banks. The memory area of
a memory architecture is the sum of the area of all the logical memory banks.
6.2.3.2 Chromosome Selection and Generation
The strongest individuals in a population are used to produce new off-springs. The
selection of an individual depends on its fitness; an individual with a higher fitness has
a higher probability of contributing one or more offsprings to the next generation. In
every generation, from the P individuals of the current generation, M new offsprings
are generated using mutation and crossover operators, resulting in a total population
of (P + M). The crossover operation is performed as illustrated in Figure 6.5. From
this total population of (P + M), P fittest individuals survive to the next generation.
The remaining M individuals are annihilated. Crossover and mutation operators are
implemented in standard way.
6.2.3.3 Fitness Function and Ranking
For each of the individuals, the fitness function computes Marea and Mpow. Note that
Mcyc is not computed as it is already available from LME. The Marea is obtained from
the memory mapping block, which is the sum of area of all the physical memory modules
used in the chromosome. Mpow is computed based on two factors: (a) access frequency
of data-sections and the data-placement information and (b) power per read/write access
information derived from the semiconductor vendor memory library for all the physical
memory modules.
To compute the memory power the method uses the data layout information provided
by the LME step. Based on the data layout, and the physical memories required to form
the logical memory (obtained from the chromosome representation), the accesses to each
data section is mapped to the respective physical memories. From this, the power per
2Although the chromosome representation may have more physical memories than required to con-struct the given logical memory, the fitness function (area and power estimates) is derived only for therequired physical memories.
120 Physical Memory Exploration
access for each physical memory, and the number of accesses to the data section, the
total memory power consumed for all accesses to a data section is determined. From this,
the total memory power consumed by the entire application on a given physical memory
architecture is computed by summing the power consumed by all the data sections.
Once the memory area, memory power and memory cycles are computed for all the
individuals in the population, individuals are ranked according to the Pareto optimal-
ity conditions given in the following equation, which is similar to the Pareto optimal-
ity condition discussed in Chapter 4, but considers all three objective functions. Let
(Mapow,Ma
cyc,Maarea) and (M b
pow,M bcyc,M
barea) be the memory power, memory cycles and
memory area of chromosome A and chromosome B. A dominates B if the following
expression is true.
(((Mapow < M b
pow) ∧ (Macyc ≤ M b
cyc) ∧ (Maarea ≤ M b
area))
∨((Macyc < M b
cyc) ∧ (Mapow ≤ M b
pow) ∧ (Maarea ≤ M b
area))
∨((Maarea < M b
area) ∧ (Macyc ≤ M b
cyc) ∧ (Mapow ≤ M b
pow)))
For ranking of the chromosomes, we use the non-dominated sorting process described
in Section 4.3. The GA must be provided with an initial population that is created
randomly. In our implementation we have used a fixed number of generations as the
termination criterion.
6.3 Direct Physical Memory Exploration (DirPME)
Framework
6.3.1 Method Overview
In the LME2PME approach described in the previous section, the physical memory ex-
ploration is done in two steps. In this section we describe the DirPME framework that
6.3 Direct Physical Memory Exploration (DirPME) Framework 121
directly operates in the physical memory design space.
ory exploration framework consists of two levels. The outer level explores various mem-
ory architectures while the inner level explores placement of data sections (data layout
problem) to minimize memory stalls. More specifically the outer level, the memory ar-
chitecture exploration phase, targets the optimization of cache and SPRAM size and the
organization of cache architecture, including cache-line size and associativity. We use
an exhaustive search1 for memory architecture exploration by imposing certain practical
constraints (such as, the memory bank size is always a power of 2) on the architectural
parameters. Although these constraints limit the search space, they still allow all “prac-
tical” architectures to be considered and at the same time help to reduce the run-time
of the memory exploration phase drastically. The exploration module takes the applica-
tion’s total data size as input and provides an instance of memory architecture by defining
(a)cache size (b) cache block size (c) cache associativity and (d) SPRAM size2. Based on
the SPRAM size and the application access characteristics, the data partitioning heuris-
tic identifies the data sections to be placed in SPRAM. The remaining data sections are
placed in off-chip RAM. The details of the data partitioning heuristic are presented in
Section, 7.3.
The cache conscious data layout heuristic assigns addresses to the data sections placed
in off-chip RAM such that these data do not conflict in the cache. The data layout heuristic
uses the temporal access information as input to find the optimal data placement. The
objective is to minimize the number of cache misses. In Section, 7.4 we discuss the
proposed cache conscious data layout.
The data partitioning heuristic and data layout heuristic together place the application
data in SPRAM and off-chip RAM respectively. From the temporal access information
1Alternative approaches such as genetic algorithm or simulated annealing could also be used here.However we found the exhaustive approach does explore all practical memory architectures in a reasonableamount of computation time.
2The proposed framework can easily be extended to consider SPRAM organization parameters suchas, number of banks, number of ports etc. We do not consider it here as these were extensively dealtwith in the earlier chapters.
7.2 Solution Overview 141
of data sections and access frequency information, the run-time performance in terms of
memory stall cycles is computed. The memory stalls include stall cycles due to concurrent
accesses to the same single-ported SPRAM bank, stall cycles due to cache misses and miss-
penalty (off-chip memory access to fetch the cache block). The software eCacti [45] is used
to obtain the power per cache read-hit, read-miss, write-hit and write-miss. The SPRAM
power per read access and power per write access are obtained from the semiconductor
vendors ASIC memory library. The area for a given cache architecture is computed using
eCacti [45] and the area for SPRAM is obtained from the memory library.
Figure 7.2: Memory Exploration Framework
The exploration process is repeated for all valid memory architectures and the area,
power and performance are computed for each of these. The last step is to identify the list
of “optimal” architectures. Since this is a multi-objective problem, all the solution points
are evaluated according to the Pareto optimality conditions given by Equation 6.1 in
Section 6.2.3.3. According to this equation, (Mapow, Ma
cyc,Maarea) and (M b
pow, M bcyc,M
barea)
are the memory power, memory cycles and memory area for memory architecture A and
142 Cache Based Architectures
B respectively, then A dominates B if the following expression is true.
(((Mapow < M b
pow) ∧ (Macyc ≤ M b
cyc) ∧ (Maarea ≤ M b
area))
∨((Macyc < M b
cyc) ∧ (Mapow ≤ M b
pow) ∧ (Maarea ≤ M b
area))
∨((Maarea < M b
area) ∧ (Macyc ≤ M b
cyc) ∧ (Mapow ≤ M b
pow)))
From the set of solutions generated by the memory architecture exploration module, all the
dominated solutions are identified and removed. The non-dominated solutions form the
Pareto optimal set, which represents the set of good architectural solutions that provide
interesting design trade-off points from power, performance, cost view point.
7.3 Data Partitioning Heuristic
As cache structure has associated tag overheads, SPRAM consumes much less area than
caches on a per-bit basis [12]. Further, SPRAM memory accesses consume less power
than a memory access that is a cache-hit [12]. While the data sections mapped to off-chip
memory share the cache space dynamically and in a transparent manner, SPRAM space
is assigned to data sections exclusively if dynamic data layout is not used. As a result, the
usage of SPRAM is costly from a system perspective as it gets locked to a specific data
after data section layout unlike in caches, where the space is effectively reused through
dynamic mapping of data by hardware. Hence, SPRAM has to be carefully utilized and
the objective in a memory architecture exploration should be to minimize the SPRAM
size.
The objective of data partitioning is to identify data sections that must be placed in
SPRAM for best performance. We refer to a set of data (one or more scalar variables
or array variables) that are grouped together as one data-section. A data-section forms
an atomic unit that will be assigned a memory address. All data that are part of a data
section are placed in memory contiguously. An example of a data section is an array data
structure.
In order to identify data sections that should be mapped to SPRAM, our heuristic
7.3 Data Partitioning Heuristic 143
uses different characteristics of the data section. These include the access frequency, the
temporal access pattern and the spatial locality pattern. These are explained below.
To model the temporal access pattern of different data sections, a temporal relationship
graph (TRG) representation has been proposed in [17]. A TRG is an undirected graph,
where nodes represent data sections and an edge between a pair of nodes represents that
the two successive references of either of the data sections is interleaved by a reference to
the other. The weight associated with an edge (a, b) represents the number of times such
interleaved accesses of a and b has occurred in the access pattern. We illustrate these
ideas with the help of an example.
Let there be 4 data-sections a, b, c and d and the access pattern of these data sections
in the application be:
a a a b c b c b c b c d d d d d a a a a a a a c a c a a c a c
Figure 7.3: Example: Temporal Relationship Graph
For this access pattern the TRG is shown in Figure 7.3. Given a trace of data memory
references, the weight associated with (a, b), denoted by TRG(a, b) is the number of times
that two successive occurrences of a are intervened with at least one reference to b or vice
versa. As an example, for the pattern bcbcbcb, TRG(b, c) = 5. Note that reference to
c intervenes successive references to b on three occasions and references to b intervenes
144 Cache Based Architectures
successive references of c twice making the TRG(b, c) = 5. For the given pattern the
TRG(b, d) = 0 as there are no interleaved accesses. Hence no edge exist between b and
d. TRG is computed for all the data sections from the address trace collected from an
instruction set simulator. We define STRG(i) as the sum of all TRG weights on the edges
connected to node i. As an example from Figure 7.3, STRG(a) = 10.
Next, We define a term, spatial locality factor, which gives a measure of spatial locality
in the access trace for each data section. The spatial locality is influenced by the stride in
accessing different elements of the data section. The spatial locality factor is computed
by determining the number of misses incurred by that data section on a cache with a
single block by the filtered access trace that contains only accesses pertaining to that
data section. The spatial locality factor is the ratio of the number of such misses to the
size of the data section. For example if the accesses to data section b in the filtered trace
bbbb correspond to cache blocks b1b2b1b1, where b1 and b2 correspond to different blocks
(determined by the cache block size) and size of data section b is Sb cache blocks then the
spatial locality factor is 3/Sb.
There are three parameters that control the decision to keep a data section in an
on-chip SPRAM.
1. Access Frequency (Af) : Placing the most frequently accessed data section in
SPRAM gives better power consumption and better run-time performance.
2. Temporal Access Characteristics : A data section is said to be conflicting if it gets
accessed along with many other data sections. Placing the most conflicting data
section in SPRAM reduces the number of cache conflict misses and hence improves
the overall memory subsystem performance. This parameter is computed from the
TRG. The STRG factor is a direct indication of the extent to which a data section’s
life-time overlaps with other data sections.
3. Spatial Locality Factor (SLF) : Data sections that have lesser spatial locality factor
uses more cache lines simultaneously and thereby reduce the available cache space for
other data. Also, such data exhibits less spatial reuse, causing more cache misses
7.3 Data Partitioning Heuristic 145
Table 7.1: Input Parameters for Data Partitioning AlgorithmNotation Description
N Number of data sectionsTRG(a, b) Temporal access pattern between node a and bSTRG(a) Sum of trg weights on all edges connected to node a
AF (a) Access Frequency of data section aSLF (a) Spatial Locality Factor of data section a
which in turn increases the power consumption due to off-chip memory accesses.
Hence, it is both power and performance efficient to place a data section that has
less spatial locality factor in SPRAM.
Thus, a frequently accessed data that conflicts most with the rest of data and also
exhibits less spatial locality is an ideal candidate to be placed in SPRAM as this gives the
best performance from an overall memory subsystem perspective. For each of the data
sections, a conflict index is computed using the three parameters mentioned above. The
conflict index of a node corresponding to data section s is computed as follows.
nSTRG(s) =STRG(s)(∑Ni=1 STRG(i)
) (7.1)
nAF (s) =AF (s)(∑Ni=1 AF (i)
) (7.2)
nSLF (s) =SLF (s)(∑Ni=1 SLF (i)
) (7.3)
CI(s) = nSTRG(s) + nAF (s) + nSLF (s) (7.4)
In the above equations, SLF (s) and AF (s) correspond to the spatial locality factor
and access frequency of s respectively. The terms in the LHS of equations 7.1, 7.2 and
146 Cache Based Architectures
7.3 are normalized factors. Higher the conflict index, more suitable the data section is
for SPRAM placement. Our data partitioning heuristic algorithm is explained in Figure
7.4. The greedy heuristic sorts the data sections based on the conflict index and assigns
data sections that have the highest conflict index to SPRAM. The corresponding node
is removed from the TRG and the conflict index for the remaining data sections are
recomputed. Note that the above step is performed for every data section identified to be
placed in SPRAM. This process is repeated either until the SPRAM space is full or until
there are no more data sections to be placed.
7.4 Cache Conscious Data Layout
7.4.1 Overview
The data partitioning step places the most conflicting data in SPRAM and thereby reduces
the possible conflict misses in cache. However, the SPRAM size typically is very small
and only a few data-sections would have been placed in SPRAM3. The remaining data
sections still needs to be placed carefully in the cache to reduce the cache misses. In this
section we will be discussing the cache conscious data layout.
The problem of cache-conscious data layout is to find optimal data placement in off-
chip RAM with the following objectives: (a) to reduce the number of cache misses and (b)
to reduce address space used in off-chip RAM. In other words, the objective is to reduce
the “holes“ in off-chip RAM after placement. By this, we mean that the data sections are
placed in the off-chip RAM, in such a manner that the gap between data sections which
is left to reduce conflict misses is reduced. These gaps lead to wasted memory space and
hence increase hardware cost. To the best of our knowledge, reducing cache misses (the
first objective) has been the sole objective targeted by all earlier data layout approaches
published [17, 22, 41, 50, 53]. But it is very important to consider objective (a) in the
context of objective (b) for the following reasons.
3As mentioned earlier, data placement within the SPRAM can be done in a subsequent phase usingany of the data layout methods discussed in Chapter 3. We do not experiment this as this has beenextensively dealt with in the previous chapters.
7.4 Cache Conscious Data Layout 147
Algorithm: SPRAM-Cache Data PartitioningInputs:
N = Number of data sectionsAccess Frequency of all data sectionsTemporal Relationship Graph (TRG)Spatial Locality Factor (SLF)Data Section Sizes
Output:List of data sections to be placed in SPRAM
begin1. Compute access frequency per byte for all data sections2. Normalize the access frequency per byte for all data sections3. for i= 0 to N -1
3.1 compute STRG(i); sumSTRG += STRG(i);4. for i= 0 to N -1
4.1 nSTRG(i) = STRG(i)/sumSTRG;5. for i= 0 to N -1
5.1 compute SLF(i); sumSLF += SLF(i);6. for i= 0 to N -1
6.1 nSLF(i) = SLF(i)/sumSLF;7. for i= 0 to N -1
7.1 conflict-index(i) = nSTRG(i) + nAF(i) + nSLF(i);8. sort the data sections in descending order with respect to conflict-index9. while (available space in SPRAM)
9.1 identify the data section s with the highest conflict index9.2 place s in SPRAM if it fits within the available space9.3 update SPRAM available space to account for the above placement9.4 remove s from TRG9.5 Recompute STRG for the remaining nodes in TRG9.6 Recompute conflict index with the newly updated STRG
9. exitend
Figure 7.4: Heuristic Algorithm for Data Partitioning
148 Cache Based Architectures
• For SOC architectures with instruction cache and data cache that share the same
off-chip RAM, a data layout approach that optimizes only the data cache misses,
without considering optimization of off-chip RAM address space will use-up too
much address space by spreading the data placement, leaving many holes. This will
place severe constraints on code placement requiring the code to be placed across the
holes and in the remaining off-chip RAM. This may potentially result in additional
instruction cache misses. Hence, there is a chance that all the gains achieved by
optimizing the data cache misses is lost.
• A data layout approach which optimizes the data placement in off-chip RAM with-
out any holes will be independent of instruction cache placement. Hence, the ar-
chitecture exploration of data cache can be done independent of instruction cache.
For example, an application with 96K of data will have around 2700 hybrid archi-
tectures that are worth exploring. If the code placement is not independent of the
data layout and the code segments are placed in the holes created, then the memory
exploration process needs to consider both instruction and data cache configuration
together. This will increase the number of architectures considered. In such a sce-
nario, the number of architectures explored could increase to 50000+. Hence, it
is important to design a data layout algorithm that is independent of instruction
cache.
We formulate the cache conscious data layout problem as a graph partitioning prob-
lem [38]. Inputs to the data layout algorithm are (i) application’s data section sizes
and (ii) Temporal Relationship Graph. The data layout algorithm is explained in a
block diagram in Figure 7.5. The first step in the data layout problem is modelled
as a graph partitioning problem, where data sections are grouped into disjoint subsets,
such that the memory requirement for the data sections in a disjoint subset is less than
the cache size. More specifically, the first step is a k-way graph partitioning, where
k = dapplication data size/cache-sizee. The data sections in each of the partitions are
selected such that they have intervening accesses and hence can cause potential conflict
misses. Thus the output of graph partitioning step is k partitions with each partition
7.4 Cache Conscious Data Layout 149
Figure 7.5: Cache Conscious Data Layout
having a set of data sections that conflicts among themselves the most and the partition
size is less than cache size. Since each of the k partitions is lesser than the cache size,
each of these partitions can be mapped into off-chip RAM address space that corresponds
to one cache page. This step eliminates all the conflicts between data-sections that are in
the same partition. The graph partitioning method is discussed in detail in Section 7.4.2.
The next step in the data layout is to minimize the possible conflicts between data-
sections that are in two different partitions. This is handled by the offset-computation
step. The details of the offset computation are presented in Section 7.4.3. Once the offset-
computation step assigns cache-block offsets for each of the data section, the address
assignment step allocates unique off-chip addresses to all the data-sections. Finally, using
the address assignment, the number of cache misses and the power consumed for cache
150 Cache Based Architectures
and off-chip memory accesses are computed which is used for identifying Pareto-optimal
solution. The following subsections details the graph partitioning heuristic and offset
computation heuristic.
7.4.2 Graph Partitioning Formulation
In this section we explain the graph partitioning heuristic which is a generalisation of
Kernighan-Lin [38] and operates on the temporal relationship graph for the data sections
that need to be placed in off-chip RAM. Note that this excludes all data sections that
have been mapped to SPRAM. Given a temporal relationship graph G‘ = {V, E, s, w},where V is the set of vertices representing data sections and E is the set of edges between
a pair of data sections representing a temporal access conflict. Further functions s and
w are associated respectively with the nodes and edges of the TRG; s(u) represents the
size of the data section associated with a node u and w(u, v) represents the number of
temporal access conflicts between a pair of nodes u and v. The weight function w(u, v)
is same as TRG(u, v), but only restricted to the data sections that need to be assigned
to the off-chip RAM. The graph partitioning problem aims at dividing G into m disjoint
partitions. A m-way partition of G is a collection of subsets Gi = {Vi, Ei}, such that
• the subsets are disjoint, Vi ∩ Vj = 0, for i 6= j
• ⋃mi=1 Vi = V
• ∀ ei = (u, v) ∈ G, is in Gi iff u ∈ Gi and v ∈ Gi
The objective of the graph partitioning step is to group the nodes such that the sum
of weights on the internal nodes is maximized. The objective function that needs to be
maximized is given in Equation 7.5 with the constraint given in Equation 7.6.
∑
i
∑
ej∈Gi
w(ej) (7.5)
∑
uj∈Gi
s(uj) ≤ cache-size (7.6)
7.4 Cache Conscious Data Layout 151
An edge eext = (u, v) is said to be an external edge for a partition Gi if u ∈ Gi and
v /∈ Gi; i.e., if one of the nodes connected by e is in partition Gi and the other is not.
Similarly, an edge eint is said to be an internal edge if both the nodes it connects are
in the partition Gi. The sum of all the weights on the external edges in partition Gi
is referred as external cost (Ei =∑
eext, where eext ∈ Gi). The sum of all the weights
on the internal edges in partition Gi is referred as internal cost (Ii =∑
w(eint), where
eint ∈ Gi). The total external cost E =∑
i
∑eext∈Gi
w(eext). Thus the objective of the
partitioning problem is to find a partition with minimum external cost. Alternatively, the
graph partitioning problem can also be formulated as maximizing the total internal cost,
i.e.∑
i
∑eint∈Gi
w(eint) subject to the constraint∑
uj∈Gis(uj) ≤ cache-size∀Gi.
The optimal partitioning problem is NP-Complete [38, 66]. There are a number of
heuristic approaches [26, 47] to this problem, including the well known Kernigan Lin
heuristic [38] for two partitions. We extend the heuristic proposed in [38, 66] to solve our
problem. The Kernighan-Lin heuristic aims at finding a minimal external cost partition
of a graph into two equally sized sub-graphs. The heuristic achieves this by starting with
a random partition, and keeps swapping two nodes that gives the maximum gain. Gain
is computed as the difference between internal and external costs. Let us consider two
nodes a and b present in two different sub-graphs A and B respectively. We define external
cost(ECost) of a as Ea =∑
x∈B w(a, x) and internal cost (ICost) of a as Ia =∑
y∈A w(a, y)
for each a ∈ A. Similarly ECost and ICost of b are defined as Eb and Ib respectively. Let
Da = Ea− Ia be the difference between ECost and ICost for each a ∈ A. A result proved
by Kernighan and Lin [38] shows that for any a ∈ A and b ∈ B, if they are interchanged,
the reduction in partitioning cost is given by Rab = Da + Db − 2× w(a, b). The nodes a
and b are interchanged to partitions B and A respectively if Rab > 0.
In [66], the graph partitioning heuristic is generalized to an m-way partition. It starts
with a random set of m partitions and picks any two of the partitions and applies the
Kernighan-Lin heuristic repeatedly on this pair until no more profitable exchanges are
possible. Then these two partitions are marked as pair-wise optimal. The algorithm then
picks two other partitions to apply the heuristic. This process is repeated until all the
152 Cache Based Architectures
partitions are pair-wise optimal.
We have adapted the algorithm of [66] and added additional constraints to make it
work for our problem. The main constraints are as below:
1.∑
s(a) ≤ cache-size for all partitions;
2. if a data-section size s(a) > cache-size, then this data-section is placed in a partition
and marked optimal; and
3. Nodes a and b are interchanged to partitions B and A respectively only if Rab > 0
and if∑
a∈A s(a) < cache-size and if∑
b∈B s(b) ≤ cache-size
The output of the graph partitioning step is a collection of sub-graphs that maximizes
the internal cost and minimizes the external cost and ensures that no partition has a size
larger than the cache size4. Thus, each of the partition can be placed in the off-chip RAM
address space that maps to a cache page such that none of the data sections that are part
of the same partition will conflict in cache. Now we are left with optimizing the cache
conflicts that might arise because of conflicts from data sections belonging to two different
partitions. Since the external cost is already minimized, the number of such conflicts will
already be very less. The offset computation step, described in the following subsection,
aims at reducing conflicts caused by data sections belonging to different partitions.
7.4.3 Cache Offset Computation
The cache offset computation step aims at reducing cache conflict misses between data
sections that are part of two different partitions. Each partition is placed in the off-
chip RAM address space that corresponds to one cache page. It may be noted that the
ordering of the partitions does not have any impact on the cache misses. For each of the
data sections in a partition, a cache-block offset needs to be assigned which in turn is
used to determine a unique off-chip memory address for the data section.
4Obviously, a partition containing a data section whose size larger than the cache size will not obey thisproperty. But such data section can be considered to form l = ddata section size/cache-sizee consecutivepartitions, each less than or equal to cache size.
7.4 Cache Conscious Data Layout 153
Algorithm: Offset Computation HeuristicInputs:
TRGblk values for all the data blocksExternal costs for all the partitions (Ei)Internal costs for all the partitions (Ii)Ei,uj
External costs for node uj in a partition Gi
Cache configurationData Section Sizes
Output:Offsets assigned each of the data sections
begin1. Sort the partitions in the decreasing order of external cost2. for i =1 to k partitions
2.1 Pick the partition Gi with the highest external cost2.2 Sort the data sections in descending order
with respect to the external cost Ei,uj
2.3 for alldata sections in Gi
2.3.1 pick the data section uj with the highest Ei,uj
2.3.2 evaluate placement cost by placing uj in each of theavailable cache-line for the target cache configuration2.3.2.1 place data uj in cache in the available cache line
with the constraint that the data section must be contiguously placed2.3.2.2 compute the cost of placement by using TRGblk information
for all the data blocks already placed in the data section2.3.2.3 store the cost of placement Cl for the cache line l2.3.2.4 repeat the last three steps for all possible cache lines
3.4.3 find the cache-line l that gives the minimal cost3.4.4 assign l as the starting point for uj
3.4.5 mark the cache lines from l tol + size(uj)/block − size as not available for other data sections in Gi
2.4 end for3. end for4. placement completeend
Figure 7.6: Heuristic Algorithm for Offset Computation
154 Cache Based Architectures
To decide the offset that gives the least number of conflicts, we compute the placement
cost for all possible placements of the data section inside a cache page. To compute the
placement cost, we use a fine grained version of TRG. Note that the TRG computed in
Section 7.3 is at the granularity of data section. But to determine which offset to place
a data section, the temporal access pattern needs to be computed at a finer granularity
level. We illustrate these ideas with the help of an example.
Let there be 2 data sections a and b of size 128 bytes and 64 bytes respectively.
Consider the following access pattern: a[0]b[0]a[60]b[1]a[61]b[2]a[62]b[3]
For this access pattern TRG(a,b) is 6 as explained in Section 7.3. Basically it means
data sections a and b are accessed 6 times in an interleaving way. However, for a direct
mapped cache of size 4KB with 32 byte block size, if a is placed in address k and b is
placed in off-chip address k + 4KB, will not result in any conflict misses even though the
TRG(a,b)=6. This is because a[60], a[61] and a[62] will map to a cache line (C +1), while
a[0], b[0], b[1], b[2] and b[3] will map to cache line C. Further as if a is placed in address
k and b is placed in address k + 4KB + 32B then it will result in 5 conflict misses.
Hence, to determine, the cost of placing the data section, on the number of conflict
misses, the TRG values are needed at a more granular level. For the above example, if we
keep the granularity level as 1 cache block then the data section a is divided into 4 data
blocks and data section b is divided in to 2 data blocks. We define a new term TRGblk
that represents the temporal access pattern among data blocks. This is similar to the
approach described in [17]. The above access sequence results in a0, b0, a1, b0, a1, b0, a1, b0,
where a0, a1 and b0 represent the first two (cache-block sizes) blocks of data section a
and first block of data section b. For the above example, TRGblk will consist of nodes
a0, a1, a2, a3, b1 and b2. For the access pattern given above, TRGblk(a1, b0) = 5 and all
other TRGblk values are 0. We use the TRGblk values to compute the cost of placement,
C(s,l), for a data section s in a cache offset l.
The offset computation algorithm is explained in the Figure 7.6. To begin with, the
partitions are ordered based on the total external cost (Ei). The partition Gi with the
highest external cost is selected first for offset computation. Data sections that are part
7.5 Experimental Methodology and Results 155
of partition Gi are ordered based on the external cost of the corresponding nodes in Gi.
Data section uj with the highest external cost (Ei,uj) is taken for offset computation
first. Data section uj is placed in each of the allowable cache lines and the placement
cost is computed with the help of TRGblk. Here, by allowable, we mean there should
be contiguous cache lines free in a cache page to accommodate the data section uj. For
example, if data section size is 128 bytes and cache block size is 32 bytes, then a feasible
cache line mean 4 contiguous lines are free. Note that at this point no offset is assigned
to the data section uj. Cost of placement C(uj,`) for data section uj is computed for all
allowable cache line ` from 1 to Nl, where Nl is the total number of cache lines. The cache
line ` that has the minimum cost is assigned to to data section uj and the cache lines
from l to size(uj)/line-size is marked as full so that these cache lines are not available for
any other data section in Gi. Note that this restriction is put to ensure that the cache
offsets for all data section in a partition Gi are assigned within one cache page and this
ensures that the amount of external address space used is close to the application data
size. The above process is repeated for all data sections in partition Gi. After this the
next partition Gi+1 with the highest external cost is selected for offset computation. This
process continues until all partitions are handled.
7.5 Experimental Methodology and Results
7.5.1 Experimental Methodology
We have used Texas Instrument’s TMS32064X processor for our experiments. This pro-
cessor has 16K data cache and we have used Texas Instrument’s Code Composer Studio
(CCS) environment for obtaining profile data, data memory address traces and also for
validating data-layout placements. We have used 3 different applications - AAC(Advanced
Audio Codec), MPEG video encoder and JPEG image compression from the Mediabench
[43] for performing the experiments. We compute the TRG, sumtrg, and spatial locality
factor from the data memory address traces obtained from the CCS. We used eCacti
[45] to obtain the area and power numbers for different cache configurations. First, we
156 Cache Based Architectures
report experimental results demonstrating the benefits of our cache-conscious data layout
method. Subsequently in Section 7.5.4, we repeat the results pertaining to cache-SPRAM
memory architecture exploration.
7.5.2 Cache-Conscious Data Layout
In this section we present results on our cache conscious data layout and we compare our
results with the approach proposed by Calder [17]. We have used the above 3 mediabench
applications and 4 different cache sizes. In this experiment, for all the cache sizes we
have used a 32 byte cache-block size and direct mapped cache configuration. Table 7.2
presents the results of the data layout. Column 4 in Table 7.2 presents the number of
cache-misses incurred when the data-layout approach of [17] is used and the Column 5
gives the number of cache misses incurred when our data layout approach is applied. Our
approach performs consistently better and reduces the number of cache misses especially
for AAC and MPEG. Our method achieves upto 34% reduction (for AAC with 16KB
cache size) in cache misses. Also our approach consumes an off-chip memory address
space that is very close to the application data-size. This is by construction of the graph-
partitioning approach and avoiding gaps during data layout as explained in Section 7.4.
Whereas Calder’s [17] approach consumes 1.5 to 2.6 times the application data-size in the
off-chip address space to achieve the performance given in Table 7.2. This is a significant
advantage of our approach, as increased off-chip address space implies increased memory
cost for the SoC.
In Table 7.3, we present the results of our approach for different cache configurations
(direct mapped, 2-way and 4-way set associative caches). Note that these experiments
are performed with cache only architecture and no SPRAM. Observe that for all the
applications, the reduction in misses is significant for 2-way and 4-way set associative
caches. However, for the 4KB cache configuration for MPEG, the reduction in cache
misses is not much. This is due to the large data set (footprint) requirement for MPEG.
Also, observe that the data set for JPEG is much smaller and hence a direct mapped 16K
cache or 4-way set associative 8KB cache could resolve most of the conflict misses.
7.5 Experimental Methodology and Results 157
Table 7.2: Data Layout Comparison
Appli- Cache Number of Number of cache misses imrove-cation Size memory Calder Graph- ment