On-Chip Memory Architecture Exploration of Embedded System ... · PDF fileOn-Chip Memory Architecture Exploration of Embedded System on Chip A Thesis Submitted for the Degree of Doctor

On-Chip Memory Architecture Exploration of

Embedded System on Chip

A Thesis

Submitted for the Degree of

Doctor of Philosophy

in the Faculty of Engineering

by

T.S. Rajesh Kumar

Supercomputer Education and Research Centre

Indian Institute of Science

Bangalore – 560 012

September 2008

To my Family, Sree, Amma, Advika and Adarsh

Abstract

Today’s feature-rich multimedia products require embedded system solution with complex

System-on-Chip (SoC) to meet market expectations of high performance at low cost and

lower energy consumption. SoCs are complex designs with multiple embedded processors,

memory subsystems, and application specific peripherals. The memory architecture of

embedded SoCs strongly influences the area, power and performance of the entire system.

Further, the memory subsystem constitutes a major part (typically up to 70%) of the

silicon area for the current day SoC.

The on-chip memory organization of embedded processors varies widely from one

SoC to another, depending on the application and market segment for which the SoC is

deployed. There is a wide variety of choices available for the embedded designers, starting

from simple on-chip SPRAM based architecture to more complex cache-SPRAM based

hybrid architecture. The performance of a memory architecture also depends on how

the data variables of the application are placed in the memory. There are multiple data

layouts for each memory architecture that are efficient from a power and performance

viewpoint. Further, the designer would be interested in multiple optimal design points

to address various market segments. Hence a memory architecture exploration for an

embedded system involves evaluating a large design space in the order of 100,000 of

design points and each design points having several tens of thousands of data layouts.

Due to its large impact on system performance parameters, the memory architecture is

often hand-crafted by experienced designers exploring a very small subset of this design

space. The vast memory design space prohibits any possibility for a manual analysis.

In this work, we propose an automated framework for on-chip memory architecture

exploration. Our proposed framework integrates memory architecture exploration and

data layout to search the design space efficiently. While the memory exploration selects

specific memory architectures, the data layout efficiently maps the given application on

to the memory architecture under consideration and thus helps in evaluating the memory

architecture. The proposed memory exploration framework works at both logical and

physical memory architecture level. Our work addresses on-chip memory architecture for

DSP processors that is organized as multiple memory banks, with each back can be a

single/dual port banks and with non-uniform bank sizes. Further, our work also address

memory architecture exploration for on-chip memory architectures that is SPRAM and

cache based. Our proposed method is based on multi-objective Genetic Algorithm based

and outputs several hundred Pareto-optimal design solutions that are interesting from a

area, power and performance viewpoints within a few hours of running on a standard

desktop configuration.

Acknowledgments

There are many people I would like to thank who have helped me in various ways.

First and foremost I would like to thank my Supervisors, Prof. R. Govindarajan and

Dr.C.P. Ravikumar, who have guided me and supported me in various aspects through the

entire journey in completion of my thesis work. I profusely thank for the encouragement

they provided and their perseverance in keeping me focused on the Ph.D. work.

I would like to express my gratitude to Texas Instruments for giving me the time

and opportunity to pursue my studies. I would like to thank my colleagues at Texas

Instruments for their support and reviews. In particular my manager Balaji Holur.

I would also like to thank my previous managers Pamela Kumar and Manohar Sam-

bandam.

Last but not the least, I would like to thank my dearest family members for the

encouragement they provided and the sacrifices they made to help me achieve my goals.

iv

Contents

Abstract i

Acknowledgments iii

List of Publications from this Thesis 1

1 Introduction 3

1.1 Application Specific Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 On-chip Memory Organization . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Cache-based Memory Organization . . . . . . . . . . . . . . . . . . 6

1.2.3 Scratch Pad Memory-based Organization . . . . . . . . . . . . . . . 7

1.3 Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Memory Architecture Exploration . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Embedded System Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Background 23

2.1 On-chip Memory Architecture of Embedded Processors . . . . . . . . . . . 23

2.1.1 DSP On-chip SPRAM Architecture . . . . . . . . . . . . . . . . . . 23

2.1.2 Microcontroller Memory Architecture . . . . . . . . . . . . . . . . . 25

2.2 Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

2.2.1 DSP Software Optimizations . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 MCU Software Optimizations . . . . . . . . . . . . . . . . . . . . . 28

2.3 Cache Based Embedded SOC . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.1 Cache-SPRAM Based Hybrid On-chip Memory Architecture . . . . 30

2.4 Genetic Algorithms - An Overview . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Multi-objective Multiple Design Points . . . . . . . . . . . . . . . . . . . . 33

3 Data Layout for Embedded Applications 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Method Overview and Problem Statement . . . . . . . . . . . . . . . . . . 38

3.2.1 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.2 Handling Multiple Memory Banks . . . . . . . . . . . . . . . . . . . 44

3.3.3 Handling SARAM and DARAM . . . . . . . . . . . . . . . . . . . . 46

3.3.4 Overlay of Data Sections . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.5 Swapping of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Genetic Algorithm Formulation . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5.1 Data Partitioning into Internal and External Memory . . . . . . . . 50

3.5.2 DARAM and SARAM placements . . . . . . . . . . . . . . . . . . . 50

3.6 Experimental Methodology and Results . . . . . . . . . . . . . . . . . . . . 53

3.6.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 53

3.6.2 Integer Linear Programming - Results . . . . . . . . . . . . . . . . 54

3.6.3 Heuristic and GA Results . . . . . . . . . . . . . . . . . . . . . . . 58

3.6.4 Comparison of Heuristic Data Layout with GA . . . . . . . . . . . 59

3.6.5 Comparison of Different Approaches . . . . . . . . . . . . . . . . . 61

3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

vii

4 Logical Memory Exploration 67

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.1 Memory Architecture Parameters . . . . . . . . . . . . . . . . . . . 70

4.2.2 Memory Architecture Exploration Objectives . . . . . . . . . . . . . 71

4.2.3 Memory Architecture Exploration and Data Layout . . . . . . . . . 73

4.3 Genetic Algorithm Formulation . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.1 GA Formulation for Memory Architecture Exploration . . . . . . . 74

4.3.2 Pareto Optimality and Non-Dominated Sorting . . . . . . . . . . . 75

4.4 Simulated Annealing Formulation . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.1 Memory Subsystem Optimization . . . . . . . . . . . . . . . . . . . 78

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


4.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 Data Layout Exploration 93

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3 MODLEX: Multi Objective Data Layout EXploration . . . . . . . . . . . . 96

5.3.1 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.2 Mapping Logical Memory to Physical Memory . . . . . . . . . . . . 98

5.3.3 Genetic Algorithm Formulation . . . . . . . . . . . . . . . . . . . . 98



5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4.3 Comparison of MODLEX and Stand-alone Optimizations . . . . . . 108

5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

viii

6 Physical Memory Exploration 111

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2 Logical Memory Exploration to Physical Memory Exploration (LME2PME) 114

6.2.1 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2.2 Physical Memory Exploration . . . . . . . . . . . . . . . . . . . . . 115


6.3 Direct Physical Memory Exploration (DirPME) Framework . . . . . . . . . 120

6.3.1 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120




6.4.2 Experimental Results from LME2PME . . . . . . . . . . . . . . . . 125

6.4.3 Experimental Results from DirPME . . . . . . . . . . . . . . . . . . 126

6.4.4 Comparison of LME2PME and DirPME . . . . . . . . . . . . . . . 130

6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Cache Based Architectures 137

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.3 Data Partitioning Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.4 Cache Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.4.2 Graph Partitioning Formulation . . . . . . . . . . . . . . . . . . . . 150

7.4.3 Cache Offset Computation . . . . . . . . . . . . . . . . . . . . . . . 152



7.5.2 Cache-Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . 156

7.5.3 Cache-SPRAM Data Partitioning . . . . . . . . . . . . . . . . . . . 158

7.5.4 Memory Architecture Exploration . . . . . . . . . . . . . . . . . . . 162

ix

7.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.6.1 Cache Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . 163

7.6.2 SPRAM-Cache Data Partitioning . . . . . . . . . . . . . . . . . . . 166

7.6.3 Memory Architecture Exploration . . . . . . . . . . . . . . . . . . . 167

7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8 Conclusions 171

8.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.2.1 Standardization of Input and Output Parameters . . . . . . . . . . 174

8.2.2 Impact of platform change on system performance . . . . . . . . . . 174

8.2.3 Impact of Application IP library rework on system performance . . 174

8.2.4 Impact of semiconductor library rework on the system performance 175

8.2.5 Multiprocessor Architectures . . . . . . . . . . . . . . . . . . . . . . 175

Bibliography 176

List of Tables

1.1 Explanation of Xchart Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 List of Symbols Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Memory Architecture for the Experiments . . . . . . . . . . . . . . . . . . 54


3.4 Results from Heuristic Placement (HP) and Genetic Placement (GP) on 4

Embedded Applications, VE = Voice Encoder, JP = JPEG Decoder, LLP

= Levinson’s Linear Predictor, 2D = 2D Wavelet Transform. . . . . . . . 59

3.5 Comparative Ranking of Algorithms . . . . . . . . . . . . . . . . . . . . . 62

4.1 Memory Architecture Parameters . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Evaluation of Multi-Objective Cost Function . . . . . . . . . . . . . . . . . 79


4.4 Non-dominant Points Comparison GA-SA . . . . . . . . . . . . . . . . . . 85

5.1 Memory Architectures Used for Data Layout . . . . . . . . . . . . . . . . 103

6.1 Memory Architectures Explored - Using DirPME Approach . . . . . . . . . 129

6.2 Non-dominant Points Comparison LME2PME-DirPME . . . . . . . . . . . 134

7.1 Input Parameters for Data Partitioning Algorithm . . . . . . . . . . . . . . 145

7.2 Data Layout Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.3 Data Layout for Different Cache Configurations . . . . . . . . . . . . . . . 157

List of Figures

1.1 Architecture of an Embedded SoC . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Embedded Application Development Flow . . . . . . . . . . . . . . . . . . 9

1.3 Memory Trends in SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Application Specific SoC Design Flow Illustration with X-chart . . . . . . . 14

1.5 Mapping Chapters to X-chart Steps . . . . . . . . . . . . . . . . . . . . . . 21

2.1 Example DSP Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Cache-SPRAM Based On-Chip Memory Architecture . . . . . . . . . . . . 31

2.3 Genetic Algorithm Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Overview of Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Illustration of Parallel and Self Conflicts . . . . . . . . . . . . . . . . . . . 39

3.3 Heuristic Algorithm for Data Layout . . . . . . . . . . . . . . . . . . . . . 52

3.4 Relative performance of the Genetic Algorithm w.r.t. Heuristic, for Varying

Number of Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5 Comparison of Heuristic Data Layout Performance with GA Data layout . 60

4.1 DSP Processor Memory Architecture . . . . . . . . . . . . . . . . . . . . . 71

4.2 Two-stage Approach to Memory Subsystem Optimization . . . . . . . . . . 74

4.3 Comparison of GA and SA Approaches for Memory Exploration . . . . . . 82

4.4 Vocoder Non-dominated Points Comparison Between GA and SA . . . . . 83

4.5 Vocoder: Memory Exploration (All Design Points Explored and Non-dominated

Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

xii

4.6 MPEG: Memory Exploration (All Design Points Explored and Non-dominated

Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.7 JPEG: Memory Exploration (All Design Points Explored and Non-dominated

Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.8 DSL: Memory Exploration (All Design Points Explored and Non-dominated

Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.1 MODLEX: Multi Objective Data Layout EXploration Framework . . . . . 97

5.2 Data Layout Exploration: MPEG Encoder . . . . . . . . . . . . . . . . . . 104

5.3 Data Layout Exploration: Voice Encoder . . . . . . . . . . . . . . . . . . . 105

5.4 Data Layout Exploration: Multi-Channel DSL . . . . . . . . . . . . . . . . 105

5.5 Individual Optimizations vs Integrated . . . . . . . . . . . . . . . . . . . . 106


6.2 Memory Architecture Exploration - Integrated Approach . . . . . . . . . . 113

6.3 Logical to Physical Memory Exploration - Overview . . . . . . . . . . . . . 115

6.4 Logical to Physical Memory Exploration - Method . . . . . . . . . . . . . . 117

6.5 GA Formulation of LME2PME . . . . . . . . . . . . . . . . . . . . . . . . 118

6.6 MAX: Memory Architecture eXploration Framework . . . . . . . . . . . . 121

6.7 GA Formulation of Physical Memory Exploration . . . . . . . . . . . . . . 123

6.8 Voice Encoder: Memory Architecture Exploration - Using LME2PME Ap-

proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.9 MPEG: Memory Architecture Exploration - Using LME2PME Approach . 128

6.10 DSL: Memory Architecture Exploration - Using LME2PME Approach . . . 129

6.11 Voice Encoder (3D view): Memory Architecture Exploration - Using DirPME

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.12 Voice Encoder: Memory Architecture Exploration - Using DirPME Approach131

6.13 MPEG Encoder: Memory Architecture Exploration - Using DirPME Ap-

proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.14 DSL: Memory Architecture Exploration - Using DirPME Approach . . . . 133

xiii

7.1 Target Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2 Memory Exploration Framework . . . . . . . . . . . . . . . . . . . . . . . . 141

7.3 Example: Temporal Relationship Graph . . . . . . . . . . . . . . . . . . . 143

7.4 Heuristic Algorithm for Data Partitioning . . . . . . . . . . . . . . . . . . 147

7.5 Cache Conscious Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.6 Heuristic Algorithm for Offset Computation . . . . . . . . . . . . . . . . . 153

7.7 AAC: Performance for different Hybrid Memory Architecture . . . . . . . . 158

7.8 MPEG: Performance for different Hybrid Memory Architecture . . . . . . . 159

7.9 JPEG: Performance for different Hybrid Memory Architecture . . . . . . . 159

7.10 AAC: Power consumed for different hybrid memory architecture . . . . . . 160

7.11 MPEG: Power consumed for different hybrid memory architecture . . . . . 161

7.12 JPEG: Power consumed for different hybrid memory architecture . . . . . 161

7.13 AAC: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . . 163

7.14 MPEG: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . 164

7.15 JPEG: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . 164

List of Publications from this Thesis

1. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. On-chip Memory Architecture

Exploration Framework for DSP Processor Based Embedded SoC. Submitted to the ACM

Transactions on Embedded Computing Systems, May 2008.

2. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Architecture Explo-

ration Framework for Cache-based Embedded SoC. In Proceedings of the International

Conference on VLSI Design, Jan 2008.

3. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MODLEX: A Multi-Objective

Data Layout EXploration Framework for Embedded SoC. In Proceedings of the 12th Asia

and South Pacific Design Automation Conference (ASP-DAC), Jan 2007.

4. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. MAX: A Multi-Objective

Memory Architecture Exploration Framework for Embedded SoC. In Proceedings of the

International Conference on VLSI Design, Jan 2007.

5. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Embedded Tutorial on Multi-

Processor Architectures for Embedded SoC. In Proceedings of the VLSI Design and Test,

Aug 2003.

6. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Optimal Code and Data Lay-

out for Embedded Systems. In Proceedings of the International Conference on VLSI

Design, Jan 2003.

7. T.S.Rajesh Kumar, R.Govindarajan, and C.P. Ravikumar. Memory Exploration for Em-

bedded Systems. In Proceedings of the VLSI Design and Test, Aug 2002.

2 List of Publications from this Thesis

Chapter 1

Introduction

1.1 Application Specific Systems

Today’s VLSI technology allows us to integrate tens of processor cores on the same chip

along with embedded memories, application specific circuits, and interconnect infrastruc-

ture. As a result, it is possible to integrate an entire system onto a single chip. The single

chip phone, which has been introduced by several semiconductor vendors, is an example of

such a system-on-chip; it includes the modem, radio transceiver, power management func-

tionality, a multimedia engine and security features, all on the same chip. An embedded

system is an application-specific system which is optimized to perform a single function

or a small set of functions [70]. We distinguish this from a general-purpose system, which

is software-programmable to perform multiple functions. A personal computer is an ex-

ample of a general-purpose system; depending on the software we run on the computer,

it can be useful for playing games, word processing, database operations, scientific com-

putation, etc. On the other hand, a digital camera is an example of an embedded system,

which can perform a limited set of functions such as taking pictures, organizing them, or

transferring them to another device through a suitable I/O interface. Other examples of

embedded systems include mobile phones, audio/video players, videogame consoles, set-

top boxes, car infotainment systems, personal digital assistants, telephone central-office

switches, dedicated network routers and bridges. Note that a large number of embedded

4 Introduction

systems are built for the consumer market. As a result, in order to be competetive, the

cost of an embedded system cannot be very high. Yet, the consumers demand higher per-

formance and more features from the embedded systems products. It is easy to appreciate

this point if we compare the performance and feature set offered by mobile phones that

cost Rs 5000/-(or 100$) today and which cost the same a few years ago. We also see that

a large number of embedded systems are being built for the mobile market. This trend

is not surprising - the number of mobile phone subscribers increased from 500 Million in

year 2000 to 2.6 Billion in 2007 [7]. Because of such high volumes, embedded systems are

extremely cost sensitive and their design demands careful silicon-area optimization. Since

mobile devices use batteries as the main source of power, embedded systems must also be

optimized for energy dissipation. Power, which represents the rate at which energy is con-

sumed, must also be kept low to avoid heating and improving reliability. In summary, the

designer of an embedded system must simultaneously consider and optimize price, perfor-

mance, energy, and power dissipation. Application specific embedded systems designed

today demand innovative methods to optimize these system cost functions [11, 19].

Many of today’s embedded systems are based on system-on-chip platforms [16], which,

in turn, consist of one or more embedded microcontrollers, digital signal processors (DSP),

application specific circuits and read-only memory, all integrated into a single package.

These blocks are available from vendors of intellectual property (IP) as hard cores or soft

cores [42, 28]. A hard core, or hard IP block, is one where the circuit is available at a

lower level of abstraction such as the layout-level [42, 28]; it is impossible to customize a

hard IP to suit the requirements of the embedded system. As a result, there are limited

opportunities in optimizing the cost functions by modifying the hard IP. For example, if

some functionality included in the IP is not required in the present application, we cannot

remove the function to save area. Soft IP refers to circuits which are available at a higher

level of abstraction, such as register-transfer level [28, 42]. It is possible to customize the

soft IP for the specific application. The designer of an embedded SoC integrates the IP

cores for processors, memories, and application-specific hardware to create the SoC.

Figure 1.1 illustrates the architecture of an embedded system-on-chip (SoC). As can

1.2 Memory Subsystem 5

be seen in the figure, there are four principal components in such an SoC.

1. An Analog Front End which includes the analog/digital and digital/analog convert-

ers

2. Programmable Components which include microprocessors, microcontrollers, and

DSPs. The number of embedded processors is increasing every year. An interesting

statistic shows that of the nine billion processors manufactured in 2005, less than 2%

were used for general-purpose computers. The other 8.8 billion went into embedded

systems [13]. The microcontroller/microprocessor is useful in handling interrupts,

house-keeping and performing timing related functions. The DSP is useful for pro-

cessing the audio and video information e.g., compression and decompression of

audio and video information. The application software is normally preloaded in

the memory and is not user programmable, unlike general-purpose processor-based

systems

3. Application-specific components – these include hardware accelerators for compute-

intensive functions. Examples of hardware accelerators include digital image pro-

cessors which are useful in cameras

1.2 Memory Subsystem

1.2.1 On-chip Memory Organization

The memory architecture of an embedded processor core is complex and is custom de-

signed to improve run-time performance and power consumption. In this section we

describe only on the memory architecture of the DSP processor as this is the focus of

the thesis. This is because, the memory architecture of the DSP is more complex than

that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are

more data dominated than the control-dominated software executed on an MCU. Mem-

ory bandwidth requirements for DSP applications range from 2 to 3 memory accesses per

6 Introduction

Figure 1.1: Architecture of an Embedded SoC

processor clock cycle. For an MCU, this figure is, at best, one memory access per cycle.

(b) It is critical in DSP application to extract maximum performance from the memory

subsystem in order to meet the real-time constraints of the embedded application. As a

consequence, the DSP software for critical kernels is developed mostly as hand optimized

assembly code. In contrast, the software for MCU is typically developed in high-level

languages. The memory architecture for a DSP is unique since the DSP has multiple on-

chip buses and multiple address generation units to service higher bandwidth needs. The

on-chip memory of embedded processors can include (a) only Level-1 cache (L1-cache)

(e.g., [1]), (b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of

L1-cache and SPRAM (e.g., [2, 77]).

1.2.2 Cache-based Memory Organization

Purely cache-based on-chip memory organization is generally not preferred by embedded

system designers as this organization cannot guarantee the worst-case execution time

constraints. This is because the access time in a cache based system can vary depending

on whether the access results in a cache miss or a hit [33]. As a consequence, the run-time

1.2 Memory Subsystem 7

performance of cache-based memory subsystems varies, based on the execution path of

application and is data dependent. However cache architecture is advantageous in the

sense that it reduces programmer’s responsibility in terms of placement of data to achieve

better memory access time. Further the movement of data from off-chip memory to cache

is transparent. In [12], the authors present a comparison study of SPRAM and cache for

embedded applications and conclude that SPRAM has 34% smaller area and 40% lower

power consumption than a cache of the same capacity. There is published literature to

estimate the worst case execution time [81] and find an upper bound on run-time [78]

for cache-based embedded systems. Hence it was argued that for real-time embedded

systems which require stringent worst-case performance guarantee, purely cache based

on-chip organization is not suitable.

1.2.3 Scratch Pad Memory-based Organization

On-chip memory organization based only on Scratch Pad memory ensures single cycle

access times and guarantees on worst-case execution for data that resides in Scratch-Pad

RAM (SPRAM). However, it is the responsibility of the programmer to identify data

section that should be placed in SPRAM or place code in the program to appropriately

move data from off-chip memory to SPRAM. A DSP core can include the following types of

memories static RAM (SRAM), ROM, and/or dynamic RAM (DRAM). The scratch pad

memory in the DSP core is organized into multiple memory banks to facilitate multiple

simultaneous data accesses. A memory bank can be organized as a single-access RAM

(SARAM) or a dual-access RAM (DARAM) to provide single or dual access to the memory

bank in a single cycle. Also the on-chip memory banks can be of different sizes. Smaller

memory banks consume lesser power per access than the larger memories. The embedded

system may also be interfaced to off-chip memory, which can include SRAM and DRAM.

Purely SPRAM based on-chip organization is suitable only for low to medium complex

embedded applications. SPRAM based systems do not use the on-chip RAM efficiently

as it requires the entire data sections that are currently accessed to be placed exclusively

8 Introduction

in the SPRAM. It is possible to accommodate different data sections in SPRAM at dif-

ferent points in execution time by moving data dynamically between off-chip memory

and SPRAM. But this results in certain run-time overhead and increase in code size.

For medium to large applications, which have large number of critical data variables, a

large amount of on-chip RAM will become necessary to meet the real-time performance

constraints. Hence for such applications pure SPRAM architecture are not preferred.

1.3 Data Layout

To efficiently use the on-chip memory, critical data variables of the application need to be

identified and mapped to the on-chip RAM. The memory architecture may contain both

on-chip cache and SPRAM. In such a case it is important to partition the data section and

assign them appropriately to on-chip cache and SPRAM such that memory performance

of the application is optimized. Further, among the data sections assigned to on-chip

cache and SPRAM, a proper placement of the data sections on the cache and SPRAM

is required to ensure that the cache misses are reduced and the multiple memory banks

of the SPRAM and the dual ported SPRAMs are efficiently utilized. Identifying such a

data placement for data sections, referred to as the data layout problem, is complex and

critical step [10, 53]. This task is typically performed manually as the compiler cannot

assume that the code under compilation represents the entire system [10].

The application program in a modern embedded system is complex since it must

support a variety of device interfaces such as networking interfaces, credit card readers,

USB interfaces, parallel ports, and so on. The application also has many multimedia

components like MP3, AAC and MIDI [8]. This necessitates an IP reuse methodology

[74], where software modules developed and optimized independently by different vendors

are integrated. Figure 1.2 explains the typical flow in embedded application develop-

ment. This integration is a very challenging job with multiple objectives: (a) it has to

be done under tight constraints on time-to-market constraints, (b) it has to be repeated

for different variants of SoCs with different custom memory architectures, and (c) it has

to perform in such a way that the embedded application is optimized for performance,

1.3 Data Layout 9

power consumption and cost.

Figure 1.2: Embedded Application Development Flow

Since the IPs/modules are independently optimized, the integrator is under pressure

to deliver the complete product with the expectation that each component performs at the

same level as it did in isolation. This is a major challenge. When a module is optimized

independently, the developer has all the resources of the SoC (MIPS and Memory) to

optimize the module. When these modules are integrated at the system-level, the system

resources are shared among the modules. So the application integrator needs to know

the MIPS and memory requirements of the modules unambiguously to be able to allocate

the shared resources to critical needs [74]. Usually, the modules memory requirements

are given only at a high level. To be able to optimize the whole application/system, the

integrator will need detailed memory analysis at the module-level; e.g., which data buffers

need to be placed in dual ported memories and which data buffers should not be placed

in the same memory bank – this data is usually not available. Moreover, the critical code

is usually written in low-level assembly language to meet real-time constraints and/or

10 Introduction

due to legacy reasons. Because of the above mentioned reasons, the application integra-

tion/optimization, analyzing the application and mapping software modules in order to

obtain optimal cost and performance takes significant amount of time (approximately 1-2

man months). Currently in most of the SoC design data layout is also performed manually

and it has two major problems:(1) the development time is significant – not acceptable

for current-day time to market requirements, (2) quality of solution varies based on the

expertise.

1.4 Memory Architecture Exploration

In modern embedded systems, the area and power consumed by the memory subsystem is

up to 10 times that of the data path, making memory a critical component of the design

[11]. Further, the memory subsystem constitutes a large part (typically up to 70%) of

the silicon area for the current day SoC and it is expected to go up to 94% in 2014 as

shown in the Figure 1.3 [6]. The main reason for this is that embedded memory has

a relatively smallsubsystem per-area design cost in terms of both man-power, time-to-

market and power consumption [60]. Hence the memory plays an important role in the

design of embedded SoCs. Further the memory architecture strongly influences the cost,

performance and power dissipation of an embedded SoC.

As discussed earlier, the on-chip memory organization of embedded processors varies

widely from one SoC to another, depending on the application and market segment for

which the SoC is deployed. There is a wide variety of choices available for the embed-

ded designers, starting from simple on-chip SPRAM based architecture to more complex

cache-SPRAM based hybrid architecture. To begin with, the system designer needs to

decide if the SoC requires cache and what is the right size of on-chip RAM. Once the high

level memory organization is decided, the finer parameters need to be defined to complete

the memory architecture definition. For the on-chip SPRAM based architecture, the pa-

rameters, namely, size, latency, number of memory banks, number of read/write ports per

memory bank and connectivity, collectively define the memory organization and strongly

influence the performance, cost, and power consumption. For cache based on-chip RAM,

1.4 Memory Architecture Exploration 11

Figure 1.3: Memory Trends in SoC

the finer parameters are the size of cache, associativity, line size, miss latency and write

policy. Due to its large impact on system performance parameters, the memory architec-

ture is often hand-crafted by the designer based on the targeted applications. However,

with the combination of on-chip SPRAM and cache, the memory design space is too large

for a manual analysis [31]. Also, with the projected growth in the complexity of embed-

ded systems and the vast design space in memory architecture, hand optimization of the

memory architecture will soon become impossible. This warrants an automated frame-

work which can explore the memory architecture design space and identify interesting

design points that are optimal from a performance, power consumption and VLSI area

(and hence cost) perspective. As the memory architecture design space itself is vast, a

brute force design space exploration tool may take large computation time and hence is

unlikely to be useful in meeting the tight time-to-market constraint. Further, for each

given memory architecture, there are several possible data section layouts which are opti-

mal in terms of performance and power. This further compounds the memory architecture

exploration problem.

12 Introduction

1.5 Embedded System Design Flow

In this section, we present our view of embedded system design flow to set the context

for our work. For this purpose, we introduce the notion of the X-chart, which is inspired

from the well-known Y-chart introduced by Gajski to capture the process of VLSI system

design [29].

In a Y-chart, the three levels of design abstraction form the three dimensions of the

figure Y; these are (a) design behavior, (b) design structure and (c) physical aspects of

the design. A design flow starts from a behavior specification, which is then mapped to

a structure, which in turn is mapped to a physical realization. We can view the process

of transforming a behavior to a physical realization as a successive refinement process.

Optimization of design metrics such as area, performance, and power are the goals of

each of these refinement steps. The design process may spiral from the behavioral axis to

structural axis to physical design axis in multiple stepwise refinement steps.

We introduce the notion of the X-chart, which is illustrated in Figure 1.4. The X-

chart representation has four axes: (a) Behavior, (b) Logical Architecture, (c) Physical

Architecture and (d) Software Data Layout. The logical memory architecture (LMA)

defines the embedded cache size, cache associativity, cache block size, size of the scratch

pad memory, number of memory banks, and the number of ports. The physical memory

architecture (PMA) is an actual realization of an LMA using the memory library com-

ponents provided by the semiconductor vendor. The fourth dimension, namely Software

Data Layout, is necessary for capturing the process of embedded system design. We have

identified several steps in the embedded system design flow and marked them with circled

numbers. Table 1.1 explains the individual steps in the X-chart representation.

The design of an embedded system begins with a behavioral description (Point (1)

in Figure 1.4, which is shown on the behavioral axis). Today, there are many languages

available to capture the system behavior, e.g., System Verilog [5], System C [4], and so

on. Hardware-software partitioning is performed to identify which functionalities of the

description are best performed in hardware and which are best implemented in software.

Hardware implementation is cost-intensive, but improves the performance.

1.5 Embedded System Design Flow 13

We show point (2) on the LMA axis, since hardware-software partitioning adds con-

siderable amount of detail to decide the LMA parameters. The next step is to select

hardware and software IP blocks. Depending on the time schedule (for designing the

embedded system) and the cost constraint, the designer may wish to use readily available

IP blocks from a vendor or implement a custom version of the IP. The target platform

is then defined to implement the embedded system. As mentioned earlier, a platform in-

cludes one or more processors, memory, and hardware accelerators for specific functions.

Platforms also come with software tools such as compilers and simulators, so that the

development cycle can be accelerated. In other words, one does not need to wait for the

hardware implementation to complete before trying out the software. We show point (4)

on the software data layout axis, since the selection of a platform defines many aspects

of software implementation. Software partitioning is now performed to decide which soft-

ware IP blocks are executed on which processor. This completes one spiral cycle in the

design life cycle of the embedded system. To recapitulate, the following components are

defined at the end of the first cycle (a) the platform on which the embedded system

will be built, (b) the hardware and software IP blocks that are selected for the target

application, (c) assignment of software IP blocks to target processors where the software

will be executed. We show point (5) on the behavioral axis, since the next spiral cycle

will begin from here.

The next step is to define the logical memory architecture for the memory subsystem.

Guided by considerations such as cost, performance, and power, the designer must decide

basic architectural parameters of the memory sub-system, such as whether or not to

provide cache memory, how many memory banks are provided, whether or not dual-

ported memories are necessary for guaranteeing performance, etc. The next step is to

perform design space exploration in the logical space. Each logical memory architecture

is also characterized by the selection of values for parameters such as cache size, cache

associativity, cache block size, etc. There is often a cost/performance tradeoff between two

solutions in the architectural space. Hence the designer must consider different Pareto-

optimal solutions that exhibit cost/performance tradeoff. This results in point (6) in

14 Introduction

Figure 1.4.

Figure 1.4: Application Specific SoC Design Flow Illustration with X-chart

A logical memory architecture must be translated into a physical implementation by

selecting components from the semiconductor vendors memory library. There are multiple

realizations, i.e., physical memory architectures (PMA) for the same LMA. This involves

choosing the appropriate modules based on the process technology selected in step (7),

and the corresponding semiconductor vendor memory library. These represent tradeoff

in terms of power consumed and VLSI area. This leads to point (7) in Figure 1.4. The

mapping of an LMA to a PMA is similar to the technology mapping step in logic synthesis

[53]. Data Layout (DL) is the subsequent step in the design life cycle. During this step,

the placement of data variables is determined, considering every possible implementation

1.5 Embedded System Design Flow 15

Table 1.1: Explanation of Xchart Steps

of the physical memory architecture. Once again, there are multiple solutions for data

layout for a given PMA. These solutions may exhibit tradeoffs in power, performance,

and area.

In this thesis, we use the phrase Physical Memory Architecture Exploration (PMAE)

to refer to the search for Pareto-optimal LMA/PMA/DL solutions. We capture this in

the form of an equation that follows.

PMAE =

Logical Memory Architecture Exploration

+

Memory Allocation Exploration

+

Data Layout Exploration

(1.1)

16 Introduction

In this thesis, the focus is on memory sub-system optimization, constituted by steps

(5) to (9) in Figure 1.4. The size of the solution space increases manifold during each

step of the memory exploration. If N1 optimal solutions (logical memory architectures)

are identified during memory sub-system definition, memory allocation must be explored

for each one of them, which can potentially result in N1 × N2 solutions during memory

allocation exploration. Similarly, data layout must be performed for each of the N1 ×N2

solutions from the memory allocation exploration step, and we may in general obtain

N1 ×N2 ×N3 Pareto-optimal points in the PMAE solution space. As mentioned earlier

this problem can result in exploring a combinatorially exploding large design space.

1.6 Contributions

First, we propose methods for data layout optimization, assuming a fixed memory archi-

tecture for a DSP-based embedded system architecture. Data layout is a critical compo-

nent in the embedded design cycle and decides the final configuration of the embedded

system. Data layout happens at the final stage in the life cycle of an embedded system, as

illustrated in the X-chart of Figure 1.4. Data layout forms the foundation for memory sub-

system optimization. Hence, we first formulate data section layout as an Integer Linear

programming (ILP) problem. The proposed ILP formulation can handle: (i) partitioning

of data between on-chip and off-chip memory, (ii) handling simultaneously accessed data

variables (parallel conflict) in different on-chip memory banks, (iii) placing data variables

that are accessed concurrently (self conflict) in dual-access RAMs, (iv) overlay of data sec-

tions with non-overlapping life times, and (v) swapping of data sections from/to off-chip

memory.

An important contribution of this work is the development of a simple unified ILP

formulation to handle all the above mentioned optimizations. The ILP based approach

is very effective for many moderately complex applications and delivers optimal results.

However, as the application complexity increases, the execution time of ILP method

increases drastically, making them unsuitable for large applications and in situations (such

as memory architecture exploration) where the data layout need to be solved repeatedly.

1.6 Contributions 17

Hence we looked at developing faster methods to solve this problem. We propose a

heuristic algorithm that maps the data sections to the given memory architecture and

reduces the number of memory access conflicts resulting from both self conflicts and

parallel conflicts. Finally, we also formulate the same problem in Genetic Algorithm (GA)

and compare the results of the heuristic with GA. We find that the heuristic algorithm

performs within 5% of GA’s results with GA performing better. However, the heuristic

algorithm’s run-time is an order faster than GA’s run-time making it suitable to be used

for memory architecture exploration.

Next, we address logical memory architecture exploration for DSP-based embedded

systems (step (5) to (7) in the X-chart of Figure 1.4). The input is a set of high-level

memory parameters such as the number of memory banks, size of each memory bank,

number of ports etc., that define the memory sub-system. The goal of the exploration is

to find an optimal on-chip memory organization that can run the given applications with

minimum number of memory-stalls. When an LMA is generated, it must be evaluated

for cost (in terms of VLSI area) and performance. But these depend on the data layout.

Hence to evaluate a memory architecture properly, we must first generate an efficient

data layout. We use the fast heuristic method proposed by us. We have implemented

the memory architecture exploration problem as a two-level hierarchical search, with

architectural exploration at the outer level and data-layout exploration at the inner level.

A multi-objective GA and a Simulated Annealing algorithm (SA) are used as alternate

search mechanisms for the architectural exploration problem. As the memory architecture

exploration framework consider both performance and cost (VLSI area) objectives, we

use the Pareto-optimality constraint proposed in [25] to identify design points that are

interesting from one or the other objective.

The proposed memory exploration framework is fully automatic and flexible. The

framework is also scalable, and additional objectives like power consumption can be added

easily. We have used four different applications from multimedia and communication

domains for our experiments and found 100-200 Pareto-optimal design choices (memory

architectures) for each of the applications.

18 Introduction

Next, we explore the data layout design space for a given physical memory architecture

in order to optimize the performance and power consumption of the memory subsystem.

Note that data layout exploration forms the step (8) to (9) in the X-chart representation.

We propose MODLEX, a Multi Objective Data Layout EXploration framework based

on Genetic Algorithm that explores the data layout design space for a given logical and

physical memory architecture and obtains a list of Pareto-optimal data layout solutions

from performance and power perspectives. Most of the existing work in the literature

assumes that performance and power are non-conflicting objectives with respect to data

layout. However we show that there is a significant trade-off (up to 70%) that is possible

between power and performance.

Our next step is physical memory architecture exploration (step (5 to 8) in Figure 1.4).

We propose two different methods for physical memory exploration. The first approach is

an extension of the Logical Memory Architectural Exploration (LMAE) method described

in Chapter 4 and represented in X-chart by step 5 to 6. Physical memory exploration

is performed by taking the output of LMAE and for each of the Pareto-optimal logical

memory architecture, performing a memory allocation exploration (step (6 to 7)) with an

objective to optimize power and area in the physical memory space. Note that the data

layout is fixed at the logical memory exploration stage itself and hence the performance

does not change at this step. The memory allocation exploration is formulated as a multi-

objective Genetic search to explore the design space with power and area as objectives.

We refer to this approach as LME2PME.

The second approach is a direct and integrated approach for Physical Memory Ex-

ploration, which we refer to as DirPME. This approach corresponds to a direct move

from point 5 to point 8 in Figure 1.4. In this approach, we integrate three critical com-

ponents together: (i) Logical Memory Architecture Exploration, (ii) Memory Allocation

Exploration (iii) Data layout exploration. The core engine of the memory architecture

exploration framework is formulated as a Multi-objective Non-Dominated Sorting Genetic

Algorithm (NSGA) [25]. For the data layout problem, which needs to be solved for thou-

sands of memory architectures, we use our fast efficient heuristic data layout method.

1.6 Contributions 19

Our integrated memory architecture exploration framework searches the design space by

exploring 1000s of memory architectures and lists down 200-300 Pareto-optimal design

solutions that are interesting from an area, power, and performance view point.

Next, we address the memory architecture exploration problem for hybrid memory ar-

chitectures that have a combination of SPRAM and cache. For such a hybrid architecture,

a critical step is to partition the data between on-chip SPRAM and external RAM. Data

partitioning aims at improving the overall memory sub-system performance by placing

data in SPRAM that have the following characteristics: (a) higher access frequency, (b)

data that has a overlapping life time with many other data, and (c) data that has poor

spatial access characteristics. By placing all data that exhibits the above characteristics in

SPRAM results in reducing the number of potentially conflicting data in cache, reducing

the cache misses, leading to overall memory sub-system performance improvement.

But typically the SPRAM size is small and it is not possible to accommodate all the

data identified for SPRAM placement. Hence, even after data partitioning, there will be

a significant number of potentially conflicting data sections that need to be placed in ex-

ternal RAM. If these data are need to be placed in the caches such that the conflict misses

causes between them is reduced. Cache-conscious data layout addresses this problem and

aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache

misses. This is achieved by an efficient data layout heuristic that is independent of in-

struction caches, optimizes run-time and keeps the off-chip memory address space usage

under check. We extend the above approach and perform hybrid memory architecture

exploration with the objective to optimize run-time performance, power consumption and

area.

The salient feature of our work are as follows.

• First, we provide a unified framework for logical memory exploration, memory al-

location exploration, and data layout

• Our work addresses power, performance, area optimization in an integrated frame-

work

20 Introduction

• Our work addresses memory architecture exploration framework for a hybrid mem-

ory architecture involving on-chip SPRAM and cache.

• Our work does not rely on source-code optimization for power and performance

optimization. Hence it is suitable for Platform-based/IP-based system design

1.7 Thesis Overview

The rest of the thesis is organized as follows. In the following chapter, we provide the

background material for the thesis. We begin by explaining the memory architecture of

a DSP and an MCU. We summarize the software optimizations used in the literature

to improve memory access efficiency. We explain cache-based embedded SoC and their

challenges with respect to predictability. Finally, we introduce the concepts of a Genetic

Algorithm (GA) for optimization, since GA is used in our optimization framework in the

latter chapters.

In Chapter 3, we propose different methods to address the data layout problem for on-

chip SPRAM based memory architecture. First, we propose a Integer Linear Programming

(ILP) based approach. Further, we also propose a fast and efficient heuristic for the data

layout problem. Finally, we formulate the data layout problem in Genetic Algorithm

(GA).

In Chapter 4, we present a multi-objective memory architecture exploration framework

to search the memory design space for the on-chip memory architecture with performance

and memory cost as two objectives. We address the memory architecture exploration

problem at the logical level.

Multi-objectective Data Layout Exploration problem is addressed in Chapter 5. Here,

the data layout design space is explored for a given logical memory architecture and

application with respect to performance and power.

In Chapter 6, we address the memory architecture exploration problem at physical

memory level. In this chapter we propose two different approaches for addressing the

physical memory architecture exploration.

1.7 Thesis Overview 21

SPRAM-Cache based hybrid architecture is considered in Chapter 7. In this chapter,

we propose efficient heuristic to partition data between on-chip SPRAM and cache. Fur-

ther, we propose a cache conscious data layout. The memory design space is explored by

using an exhaustive search based approach.

Finally in Chapter 8, we summarize our work and outline the future work.

As a summary, in Figure 1.5, we map each of the chapters of this thesis into the steps

of X-chart. As can be seen in the figure, this work addresses the memory subsystem explo-

ration and optimization at architectural level taking both hardware design and software

(application development) constraints into consideration.

Figure 1.5: Mapping Chapters to X-chart Steps

22 Introduction

Chapter 2

Background

In this chapter we provide the necessary background information that are useful to un-

derstand the rest of the thesis. The Following section explains the on-chip memory ar-

chitecture of Digital Signal Processors (DSPs) and Microcontrollers (MCUs). Section

2.2 presents the software optimizations used in embedded applications that are targeted

at using on-chip memory efficiently. Section 2.3 describes cache based on-chip memory

architectures and motivates the need for cache-SPRAM based hybrid architectures for

embedded SoCs. In Section 2.4, an overview of Genetic Algorithm is presented. Finally,

in Section 2.5, importance of multi-objective multiple design solutions for platform based

design is explained.

2.1 On-chip Memory Architecture of Embedded Pro-

cessors

2.1.1 DSP On-chip SPRAM Architecture

DSP processor based embedded systems have an on-chip memory which typically has a

single cycle access time [49]. The on-chip memory, also referred to as scratch pad memory,

is mapped into an address space disjoint from the off-chip memory but connected to

24 Background

the same address and data buses. 1Typically the scratch-pad memory is organized into

multiple memory banks to facilitate multiple simultaneous data accesses. DSP Processors

typically have 2 or more address generation units and multiple on-chip buses to facilitate

multiple memory accesses.

Figure 2.1: Example DSP Memory Map

Further, each on-chip memory bank can be organized either as a single-access RAM

(SARAM) or as a dual-access RAM (DARAM), to provide single or dual accesses to

the same memory bank in a single cycle. For example, Texas Instruments TMS320C54X

digital signal processor has two data read buses and one data write bus [75]. and, Texas In-

struments TMS320C55X processor has three data read busses and two data write busses,

since concurrent access to the same array are common in DSP applications [76]. Fig-

ure 2.1 presents memory map of C55X DSPs, where multiple memory banks of SARAM

and DARAM memory banks form a part of memory map, and MMR represents mem-

ory mapped registers which typically contain control registers, status registers and stack

pointers. The DARAM and SARAM regions can be recognized using multiple memory

bank to enable two concurrent accesses.

1We use the terms “scratch pad memory”, “on-chip memory” and “internal memory” interchangeably.Similarly “off-chip memory” and “external memory” are also used interchangeably.

2.1 On-chip Memory Architecture of Embedded Processors 25

2.1.2 Microcontroller Memory Architecture

Microcontroller’s (MCU) are designed to execute control type of applications efficiently.

The applications that run on microcontroller’s are not very data intensive and hence do

not require DARAM. But the real-time constraints of embedded applications and the need

to run the applications in a time bound manner requires on-chip SPRAM. Similar to DSP

memory architectures, even MCU processors have on-chip SPRAM. But unlike DSP’s on-

chip SPRAM, the MCU’s on-chip RAM is not organized as multiple memory banks. This

is because typically the MCU applications do not perform more one memory access per

clock cycle. Although the MCU’s on-chip RAM may be constructed with multiple physical

memory modules due to practical limitations and other constraints such as: (a) smaller

memory modules are faster and power efficient and (b) it is not practical to construct

one large memory module and still meet the access latency constraint. For example, to

construct a 192KB of on-chip SPRAM, hardware designers typically use 6×32KB memory

modules. However, the 6×32KB is normally not exposed to the software application

developers and hence from a software development perspective it is still one monolithic

192KB of memory. This distinction, of what is exposed to the application programmer -

referred to as logical memory architecture and how the same is realized is using physical

memory modules (banks) - referred to as physical memory architecture - is important for

both DSP and MCU.

Further, the MCU’s on-chip SPRAM memory can be realised using non-uniform sized

memory banks to optimize overall system power consumption. For example, it is area

efficient to use large memory modules to construct the on-chip SPRAM memory. But large

memories consume more power per read/write access compared a smaller memory. For

example, a 2KB memory module, typically, consumes only half the power per read/write

access as compared to a 8KB memory module. However, 4×2KB will consume more area

than the 8KB memory module. Hence there is a area-power trade-off in selecting memory

modules to construct on-chip SPRAM. The non-uniform bank sized memory architectures

aims at balancing the area-power objectives.

In this thesis our focus is on the memory architecture optimization for DSPs. This

26 Background

is because, the memory architecture of the DSP is more complex than that of microcon-

trollers (MCU) due to the following reasons: (a) DSP applications are more data dom-

inated than the control-dominated software executed on an MCU. Memory bandwidth

requirements for DSP applications range from 2 to 3 memory accesses per processor clock

cycle. For an MCU, this figure is, at best, one memory access per cycle. (b) It is criti-

cal in DSP application to extract maximum performance from the memory subsystem in

order to meet the real-time constraints of the embedded application. As a consequence,

the DSP software for critical kernels is developed mostly as hand optimized assembly

code. In contrast, the software for MCU is typically developed in high-level languages.

The memory architecture for a DSP is unique since the DSP has multiple on-chip buses

and multiple address generation units to service higher bandwidth needs. The on-chip

memory of embedded processors can include (a) only Level-1 cache (L1-cache) (e.g., [1]),

(b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of L1-cache and

SPRAM (e.g., [2, 77]).

2.2 Software Optimizations

Embedded applications have a built in hierarchy. An application is composed of several

modules, where each module consists of one or more code and data sections [74]. Each

data section consists of a set of data variables, and/or data arrays, grouped purely for the

sake of convenience. But it is also typical that each data array will be named as a separate

section. Software developers spend considerable effort and time to achieve a careful layout

of code and data sections to get maximum performance from the scratch pad memory [74].

Applicability of the optimization and methods used to perform data layout depends on

the memory architecture. There are different software optimizations necessary for on-chip

memory architectures of DSP and MCU processors, which are discussed in the following

subsections.

2.2 Software Optimizations 27

2.2.1 DSP Software Optimizations

To take advantage of the multiple on-chip memory buses provided by the underlying pro-

cessor architecture, software application developers must carefully partition the data into

several independent sections. A data section typically holds an array or a set of program

data structures and is placed contiguously in a memory bank. The data structures that

are used in the same instruction cycle are said to be mutually conflicting and are ideally

assigned to different sections so that they can be placed in different memory banks. As-

signing data structures to separate sections increases the number of placement decisions

drastically.

Several software optimization techniques for improving the performance have been

proposed in the literature [10, 14, 37, 40, 44, 53, 58, 74, 79], including:

• Placing frequently accessed data variables to on-chip SPRAM and placing less fre-

quently accessed data variables in off-chip RAM [10, 53].

• Partitioning data arrays that are accessed simultaneously in the same processor cycle

into different on-chip memory banks. This way multiple data can be simultaneously

accessed in the same cycle without incurring any additional memory stalls [40, 58].

• Mapping a data array which is required to support multiple simultaneous access to

DARAM. This avoids additional memory stalls for two simultaneous accesses. [44].

• Overlay of data structures, typically arrays, to share the same on-chip memory

space. These arrays are referred as scratch buffers [74]. The life time of these

buffers are limited to a software module. Hence scratch buffers corresponding to

different modules, which are not live simultaneously, can share the same on-chip

memory space.

• Swapping critical code and data sections from off-chip memory to on-chip memory

before the execution of the appropriate code segment. This facilities efficient access

to code/data currently being accessed. The benefits of swapping (on-chip access and

reduced memory stalls) should more than compensate the cost of swapping [37, 79].

28 Background

Except for the swapping technique, which works on both code and data, all the other

techniques concentrate only on data. Managing data is very important because most of

the embedded applications are data dominated [19].

Towards achieving this goal, critical code and data — which are accessed frequently —

are identified by performing extensive simulation and profiling of application. The decision

to place a data structure in the on-chip SRAM is taken after analyzing the frequency of

the variable in the application.

The ideal case is where all the critical code and data sections can be placed in the on-

chip memory. While this can result in very high performance, in terms of fewer memory

stall cycles, it is also prohibitively expensive to support such a large on-chip SRAM.

Hence to achieve a good performance/cost ratio, a careful data layout for the memory

architecture is mandatory.

Taking the above optimizations into consideration, a code and data section layout can

be defined as a mapping which specifies where (i.e., in which memory type) the various

code and data sections reside, the memory bank(s) on which the sections reside, the type

of memory access (single or dual access) supported in the memory bank, whether or not

certain code (or data) sections are overlayed, and whether or not certain code (or data)

sections are swapped.

2.2.2 MCU Software Optimizations

Typically, embedded applications running on the MCU are control-oriented and not very

computation intensive. The primary objective is to use the on-chip SPRAM efficiently.

Towards this, application is profiled to get the access frequency of all the data variables.

Frequently accessed variables are placed in on-chip SPRAM and less frequently accessed

variables are placed in off-chip RAM.

With an objective to optimize power consumption, a non-uniform bank size based on-

chip SPRAM architecture is used in [14]. The key idea here is, smaller banks are used to

accommodate the most frequently accessed variables, this placement optimizes the system

power. For example, let a and b be two data variables with 1KB of size and both these

2.3 Cache Based Embedded SOC 29

variables are accessed 100000 times and 20000 times respectively. For an on-chip SPRAM

size of 16KB organized as 4×2KB and 1×8K, placing a in one of the 2KB banks and

placing b in the 8KB is more power optimal than the other way.

2.3 Cache Based Embedded SOC

All programs exhibit the property of locality of reference [68] and the cache memories

exploit this property of the programs to give improved performance. Programs exhibit

two types of localities, temporal and spatial. Temporal locality indicates that a recently

accessed memory location is likely to be accessed again. And spatial locality implies that

a recently accessed memory location’s neighboring location is likely to be accessed.

In cache based architectures, data is placed in an off-chip RAM and copied at run-time

to cache by a hardware cache controller. Cache controllers increase the silicon area, but

eliminate the requirement of data placement and management and the associated run-

time overhead. The mapping of data from off-chip RAM to L1-cache is dictated by the

cache associativity scheme and can create potential side effects like thrashing. Therefore,

a careful analysis of data access characteristics and understanding of temporal access

pattern of the data structures is required to improve the cache performance.

From a power, performance and area perspective, direct mapped caches are preferred

over set-associative caches. However, direct mapped caches incur much more off-chip

memory traffic [36], which, when not handled properly, can lead to very high power

consumption and lower performance. In [36], the traffic inefficiency of direct mapped

caches is evaluated for different embedded and multimedia applications from Mediabench

[43]. The traffic (data movements from off-chip RAM to cache and vice versa) inefficiency

is in the range of 10+ even for large cache sizes, and is mainly attributed to conflict

misses. However, for application specific systems, the code is known a priori and an

optimal cache-conscious data layout will be able to reduce the number of conflict misses

and improve the performance and power consumption by reducing the off-chip memory

traffic.

30 Background

2.3.1 Cache-SPRAM Based Hybrid On-chip Memory Architec-

ture

Designers of real-time embedded memories have typically preferred scratch-pad memories

(SPRAM) over data caches since the latter lead to unpredictability in access latencies as

cache hits and misses result in different access times. For small to medium applications

usage of SPRAM gives acceptable performance when the memory is used efficiently by

using optimal data layout approaches. However, medium to large applications require

large size of SPRAM to meet the real-time performance criterion. This is because the

SPRAM memory is assigned dedicatedly to data-variables and the memory is not reused

or shared among different data-variables. There are dynamic data layout approaches that

aim at sharing the SPRAM by bringing the data-variables from off-chip RAM to on-chip

RAM at run-time [37, 79]. However, these approaches are very complex and makes the

software very difficult to maintain and debug. Data movements in the dynamic layout

approach may lead to code size and run-time overheads. Hence architectures with only

SPRAM as on-chip memory will be highly area inefficient for large applications. On the

other hand, cache based architectures uses the on-chip memory efficiently by sharing the

cache sets among data-variables at run-time. However, cache based architecture results in

unpredictability in execution times. It is difficult to estimate the worst-case guaranteed

run-time performance in a cache based system, which is a requirement for all embedded

systems. Several approaches have been proposed to predict the worst case execution time

in such systems [78, 81]. Hybrid memory architectures have become popular for real-time

embedded systems, since caches offer sharing of on-chip memory space and SPRAM offers

predictability. Hence, many embedded SOC use a mix of SPRAM and cache memories as

shown in Figure 2.2.

2.4 Genetic Algorithms - An Overview

Genetic Algorithms (GA) [30] belong to the class of stochastic search methods [69]. Other

stochastic search methods include simulated annealing [57], threshold acceptance [27], and

2.4 Genetic Algorithms - An Overview 31

Figure 2.2: Cache-SPRAM Based On-Chip Memory Architecture

some forms of branch and bound [24]. Most stochastic search methods operate on a single

solution to the problem at hand. Whereas genetic algorithms operate on a set of solutions

which lead to faster convergence.

To use a genetic algorithm, the problem at hand needs to be encoded as an object.

In GA’s terminology, the encoded object is called a chromosome. A population consist of

a set of such chromosomes. GA combines “randomness“ and “survival of the fittest“ to

perform an effective search in the solution space [30].

In Figure 2.3, the basic flow of GA is explained. To start with, the solution to the

problem at hand needs to be modeled as a chromosome. A better chromosome would mean

a better solution. The next step is create a set of P chromosomes, referred as population,

initialized by the initialization step in Figure 2.4. The objective of the GA is to keep

operating the chromosomes in the current population to generate new chromosomes and

select P fittest chromosomes. GA uses operators like selection, crossover and mutation

for this purpose. Observe that there are two nested loops in GA as shown in Figure

2.4. The outer loop corresponds to the evolution of different generations and the inner

32 Background

loop constitute the GA operations to generate a certain set of new chromosomes within

a generation.

The inner loop starts with the selection operation that picks two of the best individuals

for mating. Some of the more commonly used selection methods [30] are (i) roulette

wheel selection, (ii) tournament selection and (ii) rank selection. The probability of

selecting a chromosome in Roulette wheel selection is proportional to the fitness function

of the chromosome. For tournament selection, a set of chromosomes are selected based

on roulette wheel selection and then the top two best chromosomes are picked among the

selected chromosomes. Rank selection always picks the best two chromosomes based on

the fitness function. We have used roulette wheel selection in our work.

Figure 2.3: Genetic Algorithm Flow

A crossover operation is performed on the selected pair of chromosomes. The two

selected chromosomes are called parents. Crossover operator, typically, takes part of

2.5 Multi-objective Multiple Design Points 33

each of the parents and generates two children. Thus, the children chromosomes are

expected to have a combination of characteristics from both parents. Since the parents

are best chromosomes from the current population, the children are expected to be better

than parents by evolution theory. Typically a 3-point cross over is performed, which

is illustrated in Figure 2.3. After the crossover operation, the mutation operation is

performed with certain probability. The mutation operation randomly change certain

elements (flip a bit) of the chromosome. The mutation operator introduces certain amount

of randomness to the search. It can help the search find solutions that crossover alone

might not encounter. For each of the new chromosomes, objective functions are computed.

There can be more than one objective. A fitness function is assigned based on the set of

objectives. The fitness function represents how good a chromosome is (in other words, how

good a solution is). This set of operation is repeated (inner loop) till M new chromosomes

are generated. For each of the new M chromosomes, the objective functions and fitness

values are computed.

The last step in a generation (outer loop) is the anhilation step that is a representation

of “survival of the fittest“ concept. At the end of the inner loop there will be a total of

P +M chromosomes, where P are the parents and M are the newly generated child. Out

of the P + M chromosomes, the top P chromosomes with respect to the fitness functions

are selected and passed on to the next generation and the remaining chromosomes are

ignored. The outer loop is repeated for a given number of generations.

2.5 Multi-objective Multiple Design Points

Platform based design is a way to address the complexity of embedded system design

under tight deadlines. It is common to build systems around the same computational

platform which includes a microprocessor or microcontroller for running the operating

system and a DSP for running media-related applications. The same platform will there-

fore need to cater to different application characteristics. The OMAP platform from

Texas Instruments comes in several flavors to address the market diversity. Similarly,

Texas Instruments offers two variants of C55X DSP C5510 (with 320KB of SRAM and

34 Background

64KB of DARAM) and C5503 (with 64KB SRAM and 64KB of DARAM) for high-end

and mid-end applications. As a consequence, the platform designer is not just interested

in a single optimal design point but a set of design points. These set of design points

are termed as non-dominated design points [30] as no design point is better than any of

the non-dominated points on all objective criteria. These non-dominated points form the

Pareto Optimal set. Conditions of Pareto optimality [30] is mathematically defined as

follows. Let a vector x be partially less than y, symbolically, x <p y, when the following

conditions hold:

(x <p y) ⇐⇒ (∀i)(xi <= yi) ∧ (∃i)(xi < yi)

Using the partial relation <p, we can say if x <p y then x dominates y or y is a dominated

point. If the set of all dominated points are removed from the set of all points in the

design, we get the non-dominated set or the Pareto-optimal design points. In other

words, the complement of the dominated set with respect to the design space, gives the

Pareto-optimal set.

Each non-dominated point is an optimal design point with a specific price-performance

factor. Thus platform based design requires a set of non-dominated design points is

computed in an automated and in reasonable computation time.

Chapter 3

Data Layout for Embedded

Applications

3.1 Introduction

As discussed in Chapter 1, embedded applications are highly performance critical and the

processor resources needs to be utilized optimally to extract maximum performance. One

of the most critical step in embedded application development flow is system integration,

where all the software modules are integrated and mapped to a given target memory

architecture. This step has a large performance implication depending on how the memory

architecture is used. The memory architecture of embedded DSPs is heterogeneous and

contain memories with different access times. For example, an embedded system may

contain on-chip and off-chip memory modules with different access times, single and dual

ported memory, and multiple memory banks to support many simultaneous accesses.

During system integration, the decision to map critical data on to faster memories and

map non critical data in to slower memories is made. But, it is not easy to classify the

data into critical and non critical because of following reasons: (a) typically 70 to 80%

of code and data is legacy and may not have clear specification, (b) most of the code

in embedded DSPs are developed in assembly and hence compiler based analysis is not

possible [49] and (c) because of faster time to market constraints many of the software

36 Data Layout for Embedded Applications

modules are procured as IPs from software vendors.

The need for integrating multiple IP modules as part of embedded application de-

velopment was discussed in Section 1.3. Typically the IP modules are optimized on a

stand-alone embedded processor with a generic memory architecture. However, when

these IPs get integrated as part of the system, the memory architecture may be differ-

ent from the original target platform on which the performance of the IP module was

characterized. Hence, during system integration, the module that is integrated may not

perform at the same level as quoted in the specification given by the IP vendor. Since

the IPs/modules are independently optimized, the integrator is under pressure to deliver

the complete product with the components performing at the same level as it was inde-

pendently. This is a big challenge. When the modules are optimized independently, the

module developer has the whole SoC resources (MIPS/Memory) to optimize the mod-

ule. However, the application integrator has to consider all the modules, so that the

system resources has to be shared among different components. Hence the application

integrator needs to know the module’s requirements with respect to MIPS and memory

unambiguously to be able to allocate the shared resources to critical needs. Usually the

module’s memory requirements are given only at a high level. To be able to optimize

the whole application/system, the integrator will need detailed memory analysis at the

module level, like which data buffers need to be placed in dual ported memories and

which data buffers should not be placed in the same memory bank.This data is usually

not available. Further, the critical code is usually written in low-level assembly language

to meet real-time constraints and/or due to legacy reasons. In order to obtain good per-

formance and a reduction in memory stalls, the data buffers of the application need to be

placed carefully in different types of memory; this is known as the data section layout

problem. Typically data section layout is performed manually. Because of the above

mentioned reasons, the application integration/optimization takes significant amount of

time (approximately 1-2 man months) analyzing the application and mapping software

modules to custom memory architecture in order to obtain optimal cost and performance.

In summary there are two issues: (1) time taken is significantly more – not acceptable

3.1 Introduction 37

for current day time to market requirements, (2) quality of solution varies based on the

expertise.

The data layout optimization methods [10, 40, 44, 53, 58] varies significantly for ap-

plications built for microcontrollers (MCU) and Digital Signal Processors (DSPs) due to

the following reasons: (a) DSP applications are more data dominated than the control

software executed in MCU. Memory bandwidth requirements for DSP applications range

from 2 to 3 memory accesses per processor clock cycle. While that for MCU at best needs

only one memory access per cycle. (b) The DSP software for critical kernels is developed

mostly as hand optimized assembly code. Whereas the MCU SW is developed in high

level languages. Hence compiler based optimizations may not be directly applicable for

the DSP kernels.

In this chapter we address the data layout problem for DSP memory architectures

in different methods. First, we formulate the data section layout as a Integer Linear

programming (ILP) problem. The proposed ILP formulation can handle: (i) on-chip and

off-chip memory, (ii) multiple on-chip memory banks, (iii) single and dual access RAMs,

(iv) overlay of data sections with non-overlapping life times, and (v) swapping of data

(from/to off-chip memory). The main contribution of this work is the development of

a simple unified ILP formulation. The formulation can optimize performance or cost,

although in our work we concentrate on performance. We have developed a framework

which automatically generates the ILP formulation for an embedded application. The

ILP formulation is solved using a public domain LP solver, viz., lp solve.

The ILP based approach is very effective for many moderately complex test cases and

delivers optimal results. However, as the application complexity increases, the execution

time of ILP method becomes an issue – for some of the test cases, the run-time is more

than 24 hours – and in some cases the ILP does not yield a valid solution even after

running for 30 hours. Hence, we also formulate the data layout problem in Genetic

Algorithm (GA). Finally, as the data layout problem is the kernel for solving the memory

architecture exploration problem and needs to be invoked several times.Hence we looked

at developing faster methods to solve this problem. In this chapter we also propose a


heuristic algorithm that maps the data sections to the given memory architecture and

reduces the number of memory access conflicts (both self conflicts and parallel conflicts).

We compare the results of the heuristic, GA and ILP.

The rest of this chapter is organized as follows. The following section deals with the

necessary background and the problem statement. The ILP formulation is presented in

Section 3.3. In Section 3.4 we present the Genetic Algorithm Formulation of the data

layout problem. The Greedy back-tracking heuristic is discussed in Section 3.5. We

report the experimental results in Section 3.6. In Section 3.7 we discuss the related work.

Finally concluding remarks are presented in Section 3.8.

3.2 Method Overview and Problem Statement

3.2.1 Method Overview

Figure 3.1 explains the datalayout method in a block diagram. Initially, the application’s

data is grouped in to logical sections. This is done to reduce the number of individual

items and there by reduce the complexity. This step is important as once the data is

grouped into a section, the section can only be assigned a single location and all the data

variables inside a section will be placed contiguously starting from the given memory

address. Also the order of data placement within a section can be random and may not

affect the performance. Note that, a section cannot have both code and data. There is

a trade-off in combining different variables into a section. If too many data variables are

combined into one section, then the flexibility of placement in memory gets negatively

impacted. On the other hand, if each of the data variable is mapped into one section then

there are too many sections to handle and thus increasing the data layout complexity. In

practice, however an embedded development engineer makes a judicious choice of mapping

a set of data variables into a section. Typically, each of the large data arrays are mapped

into an individual section, and all data scalar variables belonging to a module are mapped

into a section. Note that this process is performed manually.

Once the grouping of data into sections are done, the code is compiled and executed

3.2 Method Overview and Problem Statement 39

Figure 3.1: Overview of Data Layout

for (i=0; i<n; i++) y[i] = b[i] + a[i] ∗ a[i-1]

Figure 3.2: Illustration of Parallel and Self Conflicts

in a cycle accurate software simulator. From the software simulator profile data (access

frequency) of data sections are obtained. In addition, the simulator generates a conflict

matrix that represents the parallel and self conflicts. Parallel conlicts refers to simul-

taneous accesses of two different data sections while self conflicts refers to simultaneous

accesses of same data sections. Consider the code segment in Figure 3.2.

In this code segment data sections a and b need to be accessed together and therefore

represent a parallel conflict. Accesses to a[i] and a[i-1] refer to a self conflict. If these

arrays (a,b) is placed in different memory banks or memory bank with multiple ports

then these accesses can be made concurrently without incurring additional stall cycles.

However, note that the data array (a) which has a self conflict must be placed in a memory

bank with multiple ports to avoid additional stall cycles.

The conflict relations among data sections is represented by an n×n matrix, where n

is the number of data sections. The (i, j)th element represents the conflict or concurrent

accesses between datasection i and j. The diagnol elements represent self conflicts. The


conflict matrix is symmetric.

As an example, consider an application with 4 data sections: a, b, c and d. A conflict

matrix is shown below, where the indices i and j are ordered as a, b, c and d. Section a

conflicts with itself and sections b and d. In this matrix, more specifically, a conflicts with

itself 100 times, while it conflicts with b and d 40 and 2000 times respectively. The sums

of all the conflicts for data sections a, b, c, and d are 2140, 540, 650 and 2050 respectively.

Hence the sorted order of the data sections in terms of total conflicts is a, d, c, b.

C =

100 40 0 2000

40 500 0 0

0 0 600 50

2000 0 50 0

Data section sizes, access frequency of data sections, conflict matrix and the memory

architecture are given as inputs to data layout. The objective of data layout is to efficiently

use the memory architecture by placing the most critical data section in on-chip RAM

and reduce bank-conflicts by placing conflicting data in different memory banks. Data

layout assigns memory addresses for all the data sections.

3.2.2 Problem Statement

Consider a memory architecture M with m on-chip SARAM memory banks, n on-chip

DARAM memory banks, and an off-chip memory. The size of each of the on-chip memory

bank and the off-chip memory is fixed. The access time for the on-chip memory banks

is one cycle, while that for the off-chip memory is l cycles. Given an application with d

sections, the simultaneous access requirement of multiple arrays is captured by means of a

two-dimensional matrix C where Cij represents the number of times data sections i and j

are accessed together in the same cycle in the execution of the entire program. We do not

consider more than two simultaneous accesses, as the embedded core typically supports

up to two accesses in a single cycle. If data sections i and j are placed in two different

memory banks, then these conflicting accesses can be satisfied simultaneously without

3.3 ILP Formulation 41

incurring stall cycles. Cii represents the number of times two accesses to data section i

are made in the same cycle. Self-conflicting data sections need to be placed in DARAM

memory banks, if available, to avoid stalls. The objective of the data layout problem is

to place the data sections in memory modules such that the following are minimized:

• Number of memory stalls incurred due to conflicting accesses of data sections placed

in the same memory bank

• Self-conflicting accesses placed in SARAM banks

• Number of off-chip memory accesses.

Note that the sum of the sizes of the data sections placed in a memory bank cannot

exceed the size of the memory bank.

3.3 ILP Formulation

In this section we present our data layout formulation in a step-by-step manner. We

start with the simplest problem and include the different optimizations one by one. The

following are the different steps:

1. Basic formulation – placing the data and code in memory considering only the

on-chip and off-chip memory

2. Modeling multiple on-chip memory banks

3. Handling Single and Dual Access RAMs

4. Overlay of data sections with non-overlapping life times

5. Swapping of code and data (from/to external memory)

The ILP formulation for the optimal data section layout problem, requires a number

of application related parameters. We describe these first. An embedded application is

composed of M modules. Let NDj represent the number of data sections in module j.


Table 3.1: List of Symbols UsedApplication Parameters (Constants)M Number of application modulesSDjs Size of data section s in module jADjs Access count of data section s in module jNDj Number of data sections in module j

Bjst Number of simultaneous accesses to data sections

s and t in module j.SBkj Memory size required to account for scratch buffers in

in module j placed in bank k.SWkj Memory size required to account for swapped code or data in

in module j placed in bank k.Sjs Sjs is 1 if the data section s of module j is

a scratch buffer and 0 otherwiseApplication Parameters (Variables)IDjs 1-0 variable to indicate if data section s

in module j is placed in on-chip memoryIDkjs 1-0 variable to indicate if data section s

in module j is placed in the kth internal bankEDjs 1-0 variable to indicate if data section s

of module j is placed in the external memoryZkjst 1-0 variable to indicate if data sections s

and t of module j are both placed in thekth internal memory bank

Architecture Parameters (Constants)SMi Size of on-chip memorySMe Size of off-chip memoryWe Number of cycles for off-chip memory accessNb Number of internal memory banksSMik Size of the ith internal memory bankDPk DPk is 1 if kth internal bank is DARAM

DPk is 0 otherwise.

As mentioned earlier, in our discussion, we follow the convention that each data section

refers to a single array. Let the size of data section s in module j be denoted by SDjs.

The access count for a data section is denoted by ADjs. During memory layout, a data

section occupies a block of contiguous memory locations. The value for some of the above

parameters, e.g., the access counts, can be obtained by profiling the application. In our

framework the profile data is collected using an Instruction Level Simulator.


For the ILP formulation we also require memory architecture parameters. The size of

internal (on-chip) and external (off-chip) memory are denoted by SMi and SMe respec-

tively. We also need the number of stall cycles We for each access to external memory.

Table 3.1 summarizes the list of symbols used in our formulation. With this, we are ready

to describe the basic formulation.

3.3.1 Basic Formulation

As mentioned earlier we formulate the optimal performance problem in terms of the

number of memory stall cycles. A data section s in module j placed in external memory

incurs ADjs ∗ We stall cycles. To indicate whether a data section is placed in internal

or external memory, we use a 0-1 integer variable IDjs; IDjs is 1 if the data section is

placed in on-chip memory and 0 otherwise. Thus the number of stall cycles due to data

section s is ADjs · We · (1 − IDjs). The objective of the formulation is to minimize the

total number of memory stalls. That is,

min

N∑

j=1

Ndj∑

s=1

ADjs · We · (1− IDjs)

(3.1)

Next we specify the memory constraints. Equations (3.2) and (3.3) enforce the con-

straint that the total size of the code and data sections that are placed in the external and

internal memory do not exceed the available external and internal memory respectively.

N∑

j=1

Ndj∑

s=1

SDjs · (1− IDjs) ≤ SMe (3.2)

N∑

j=1

Ndj∑

s=1

SDjs · IDjs ≤ SMi (3.3)

Lastly, we add the constraint that IDjs are 0-1 integer variables.


3.3.2 Handling Multiple Memory Banks

Embedded DSP applications are data intensive; typically two to three data sections are

accessed simultaneously in a cycle. DSP processors are designed to handle multiple data

accesses. DSPs have internal memory with multiple banks and multiple internal data

buses. The data variables that are accessed simultaneously need to be placed in different

memory banks to avoid memory stalls.

This section handles the partitioning of concurrently accessed data arrays from mul-

tiple memory banks to avoid additional stalls. Only two simultaneous data accesses are

considered; but this can be extended easily for more than two accesses. To represent the

number of simultaneous accesses to two different data sections, in a module j, we use a

2-dimensional matrix Bj which is of size Ndj × Ndj. An element of this matrix, Bjst, for

s 6= t, represents the number of simultaneous accesses to data sections s and t. Note that

Bjst refers to the total number of simultaneous accesses to the different elements of data

sections s and t. For example, if two data sections s1 and s2, each of size 100 elements,

are accessed simultaneously (as in s1[i]+s2[i]), then Bjs1s2 = 100.

Bjss refers to the number of simultaneous accesses to the same data section s. We will

consider this in our formulation in the next subsection. For the time being, we will assume

Bjss = 0 for all data section s and for all modules j. The elements of the Bj matrix are

fixed (constants), and can be obtained by profiling the application.

Let us assume that the internal memory consists of Nb banks, and the size of the ith

memory bank is SMik. The total size of the internal memory is

SMi =Nb∑

k=1

SMik

Further, let IDkjs represent whether data section s of module j reside in the kth internal

bank. Lastly, we use a (derived) 1-0 variable Zkjst to represent whether data sections s

and t of module j are both placed in internal bank k. Zkjst is 1 if and only if IDkjs = 1


and IDkjt = 1. This can be expressed by the linear inequality

Zkjst ≥ Ikjs + Ikjt − 1 (3.4)

Note that Zkjss = 1 for all sections. We replace Equations (3.2) and (3.3) in the basic

formulation with:N∑

j=1

Ndj∑

s=1

SDjs · EDjs ≤ SMe (3.5)

N∑

j=1

Ndj∑

s=1

SDjs · IDkjs ≤ SMik (3.6)

where EDjs is 1 if the sth data section reside in off-chip memory. These variables can be

expressed in terms of IDkjs as:

EDjs =

1−

Nb∑

k=1

IDkjs

(3.7)

Note that inequality (3.6) must hold for all k from 1 to Nb, the number of internal memory

banks. To enforce that each data section reside in at most 1 internal memory bank, we

add the constraint:Nb∑

k=1

IDkjs ≤ 1 (3.8)

Inequalities (3.8) must hold for all relevant values of j and s.

Lastly, the objective function in this formulation also accounts for the stalls incurred

by not placing sections s and t in different memory banks. The second term in the

following objective function accounts for this, while the first term account stall cycles due

to external memory in data sections. Thus the formulation is:

min(∑N

j=1

∑Npj

s=1 ADjs · We · EDjs+

∑Nj=1

∑Nbk=1

∑Ndj

s=1

∑Ndj

t=s+1 Bjst · Zkjst

)(3.9)

subject to constraints (3.4) to (3.8). Note that the first term includes all accesses (simul-

taneous and non-simultaneous accesses) to data section s, while the second term excludes


non-simultaneous accesses.

3.3.3 Handling SARAM and DARAM

In this formulation we account for the cost of simultaneous accesses to the same data sec-

tion. Let Bjss denote the number of such accesses. These accesses will incur an additional

stall cycle if the data section s does not reside in a memory bank that supports dual

access. Likewise, a simultaneous access to data sections s and t will incur an additional

stall cycle when they both are in memory bank k which is single ported. Let DPk = 1

denote that memory bank k is dual ported and DPk = 0 otherwise. Note that for a given

memory architecture DPk is a constant (0 or 1) and known a priori.

min(∑N

j=1

∑Npj

s=1 ADjs · We · EDjs+

∑Nj=1

∑Nbk=1

∑Ndj

s=1

∑Ndj

t=s Bjst · (1−DPk) · Zkjst

)(3.10)

subject to constraints (3.4) to (3.8).

3.3.4 Overlay of Data Sections

As mentioned earlier data sections that have non-overlapping life-times can share the same

on-chip memory space. These arrays are commonly referred to as scratch buffers. In our

discussion, we assume that scratch buffers are identified by the application developer. Let

Sjs = 1 denote that data section s is a scratch buffer; Sjs = 0 otherwise. The memory

used by a scratch buffer can be reused across different modules, but not within the same

module.

We account for the internal memory required for the scratch buffers in the following

way. For each module j, we compute SBkj, the sum of the sizes of the scratch buffers

in module j that are stored in the kth internal memory bank. The memory required for

scratch buffers in the kth internal bank corresponds to the maximum of SBkj over all

modules. That is:


SBk = maxj

Ndj∑

s=1

SDjs · Ikjs · Sjs

(3.11)

Further, the individual memory requirements for each scratch buffer which is stored in

kth internal memory bank can be excluded in constraint for internal memory (Inequal-

ity (3.6)). Thus Inequality (3.6) is replaced by

N∑

j=1

Ndj∑

s=1

Djs · (1− Sjs) · IDkjs + SBk

≤ SMik (3.12)

The constraint for external memory remains same (Inequality (3.5)). Thus the ILP for-

mulation in this case has the same objective function (Equation (3.10)) subject to con-

straints (3.4), (3.5) – (3.8), (3.11), and (3.12).

3.3.5 Swapping of Data

Swapping of a data section is generally applied in embedded DSP systems that have

large external memory and very small internal memory. Here the data that is identified

for swapping resides in external memory and copied into the internal memory (on-chip

RAM) only for the duration of execution/access of a section. A data section is identified

for swapping by carefully weighing the swapping cost against the performance benefit that

results from accessing the section from internal memory. To model swapping, we assume

that one common swap memory space SWk is allocated in the kth internal memory. The

size of SWk is the maximum of the total size of all swapped sections in a module, where

the maximum is taken across all modules. The formulation for swapping proceeds in a

manner similar to scratch buffer, where swapped section share the same memory area in

the on-chip memory bank. Additionally we have to account for off-chip requirement for

all swapped section (∑Nb

k=1 SWk). Lastly, the objective function should account for the

cost of swapping.


3.4 Genetic Algorithm Formulation

Genetic Algorithms (GAs) have been used to solve hard optimization problems [30]. Ge-

netic algorithms simulate the natural process of evolution using genetic operators such

as, natural selection, survival of the fittest, mutation and crossover in order to search the

solution space.

To map an optimization problem to the GA framework, we need the following: chro-

mosomal representation, fitness computation, selection function, genetic operators, the

creation of the initial population and the termination criteria.

For the memory layout problem, each individual chromosome should represent a mem-

ory placement. A chromosome is a vector of d elements , where d is the number of data

sections. Each element of a chromosome can take a value in (0 .. m), where 1..m repre-

sent on-chip memory banks (including both SARAM and DARAM memory banks) and 0

represents off-chip memory. Thus if the element i of a chromosome has a value k, then the

jth data section is placed in memory bank k. Thus a chromosome represents a memory

placement for all data sections. Note that a chromosome may not always represent a

valid memory placement, as the size of data sections placed in a memory bank k may

exceed the size of k. Thus the genetic algorithm should consider only valid chromosomes

for evolution. This is achieved by giving a low fitness value for invalid chromosomes. Our

initial experiments demonstrated that the above chromosome representation (vector of

decimal numbers) is more effective than the conventional bit vector representation [30]

as the latter will lead to assignment of non-existent memory banks when the number of

memory banks is not a power of 2.

Genetic operators provide the basic search mechanism by creating new solutions based

on the solutions that exist. The selection of the individuals to produce successive gener-

ation plays an extremely important role. The selection approach assigns a probability of

selection to each individual, depending on its fitness. An individual with a higher fitness

has a higher probability of contributing one or more offspring to the next generation. In

the selection process a given individual can be chosen more than once. Let us denote the

3.5 Heuristic Algorithm 49

size of the population (number of individuals) as P . Reproduction is the operation of pro-

ducing offspring for the next generation. This is an iterative process. In every generation,

from the P individuals of the current generation, M more offspring are generated. This

results in a total population of P + M . From this total population of P + M , P fittest

individuals survive to the next generation. The remaining M individuals are annihilated.

In our data layout problem, for each of the individuals, the fitness function computes the

number of resulting memory conflicts. Since GAs typically solve a maximization problem,

we change our problem as a maximization problem by negation and normalization. Recall

that a chromosome may represent an invalid solution. To discourage invalid individuals,

we associate a very low fitness value to them.

The Crossover operator takes two individuals and produces two new individuals by

merging the characteristics of the two parents at a random point (named crossover site).

Mutation is applied after crossover to each individual with a given probability. Mutation

changes an individual to produce a new one by changing some of its genes. Lastly, the

GA must be provided with an initial population that is created randomly. GAs move

from generation to generation until a pre-determined number of generations is seen or

the change in the best fitness value is below certain threshold. In our implementation we

have used a fixed number of generations as the termination criterion.

We have also developed a simulated annealing (SA) approach for the data layout

problem and experimented the same for some of the applications. The performance of

SA is comparable to that of GA, however SA takes more time to arrive at the solution.

Hence we did not consider the SA approach for data layout further in this thesis.

3.5 Heuristic Algorithm

As mentioned earlier the data layout problem is NP Complete. Further the ILP and

the GA methods described in previous sections consume significant run-time to arrive

at a solution and these methods are suitable only for obtaining an optimal data layout

for a fixed memory architecture. But to perform memory architecture exploration, this

problem is addressed in the following chapters, data layout needs to be performed for


1000s of memory architecture and it is very critical to have a fast heuristic method for

data layout. Using exact solving method such as Integer Linear Programming (ILP) are

using an evolutionary approach, such as GA or SA which takes as much as 20 to 25 minutes

of computation time for each data layout problem, may be prohibitively expensive for the

memory architecture exploration problem. Hence in this section we propose a 3-step

heuristic method for data placement.

3.5.1 Data Partitioning into Internal and External Memory

The first step in data layout is to identify and place all the data sections that are frequently

accessed in the internal memory. Data sections are sorted in descending order of frequency

per byte (FPBi) defined as the ratio of number of accesses to the size of the data section.

Based on the sorted order, data sections are greedily identified for placement in internal

memory till free space is available. We refer all the on-chip memory banks together

as internal memory. Note that the data sections are not placed at this point but only

identified for internal memory placement. The actual placement decision are taken later

as explained below.

Once all the data sections to be placed in internal memory are identified, the remaining

sections are placed in external memory. The cost of placing data section i in external

memory is computed by multiplying the access frequency of data i with the wait-states

of external memory. The placement cost is computed for all the data sections placed in

the external memory.

3.5.2 DARAM and SARAM placements

The objective of the next two steps is to resolve as many conflicts (self-conflicts and

parallel-conflicts) by utilizing the DARAM memory and the multiple banks of SARAM.

Self-conflicts can only be avoided if the corresponding data section is placed in DARAM.

On the other hand, parallel conflicts can be avoided in two ways either by placing the

conflicting data a and b in two different SARAM banks or by placing the conflicting data

a and b in any DARAM bank. But former solution is attractive as the SARAM area

3.5 Heuristic Algorithm 51

cost is much less compared to DARAM area cost. Considering that self-conflicts can only

be avoided by placing in DARAM and the cost of DARAM is very high, data placement

decisions in DARAM needs to be done very carefully. Also, many of the DSP applications

have large self-conflicting data and the DARAM placements is very crucial for reducing

the run-time of an application.

The heuristic algorithm considers placement of data in DARAM as the first step in

internal memory placements. Data sections that are identified for placement in internal

memory are sorted based on the self-conflict per byte (SPBi), defined as the ratio of self

conflicts to the size of the data section. Data sections in the sorted order of SPBi are

placed in DARAM memory until all DARAM banks are exhausted. Cost of placing data

section i in DARAM is computed and added as part of the overall placement cost.

Once the DARAM data placements are complete, SARAM placement decisions are

made. Figure 3.3 explains the SARAM placement algorithm. Parallel conflicts between

data sections i and j can be resolved by placing conflicting data sections in different

SARAM banks. The SARAM placement is started by sorting all the data sections iden-

tified for placement in internal memory based on the total number of conflicts (TCi), the

sum of all conflicts for this data section with all other data sections including self con-

flicts. Note that all data sections including the ones that are already placed in DARAM

are considered while sorting for SARAM placement. This is because the data placement

in DARAM is only tentative and may be backtracked in the backtracking step if there is

a larger gain (i.e., more parallel conflicts resolved) in placing a data section i in DARAM

instead of one or more data sections that are already placed in DARAM.

During the SARAM placement step, if the data section under consideration is already

placed in DARAM then it is ignored and the next data section in the sorted order is

considered for SARAM placement. The placement cost for placing data section i in

SARAM bank b computed considering all the data sections already placed in DARAM

and SARAM banks. Among this the memory bank b that results in the minimum cost is

chosen.

Next, the heuristic backtracks to find if there is any gain in placing data i in DARAM


Algorithm: SARAM Placement1. sort the data sections in data-in-internal-memory in descending order of TCi

2. for each data section i in the sorted order do3. if data section i is already placed in DARAM4. continue with the next data section5. else compute min-cost: minimum of cost(i, b) for all SARAM banks6. endif7. find if there is potential gain in placing data i in DARAM

by removing some of already placed sections8. if there is potential gain in back-tracking9. identify the data-set-from-daram-to-be-removed10. find the alternate cost of placing the data-set-from-daram-to-be-removed in SARAM11. if alternate cost > min-cost(i)12. continue with the placement of data i in SARAM bank b

13. update cost of placement: Mcyc = Mcyc + cost(i, b)14. else // there is gain in backtracking15. move the data-set-from-daram-to-be-removed to SARAM16. update cost of placement:

Mcyc = Mcyc + cost(g, b), for all g in data-set-from-daram-to-be-removed17. place data i in DARAM and update cost of placement18. endif19. else no gain in backtracking, continue with the normal flow20. continue with the placement of data i in SARAM bank b

21. update cost of placement: Mcyc = Mcyc + cost(i, b)22. endif23. endfor

Figure 3.3: Heuristic Algorithm for Data Layout

3.6 Experimental Methodology and Results 53

by removing some of the already placed data sections from DARAM. This is done by

considering the size of data section i and the minimum placement cost of data i in SARAM.

If there is one or more data sections (refer this set of data as daram-remove-set) in

DARAM with size more than the size of data section i and the sum of self-conflicts for all

these data sections are less than the minimum placement cost of data i in SARAM, then

there can potentially be a possibility of gain of placing data i into DARAM by removing

the daram-remove-set. Note that it is only a possibility and not a certain gain.

Once a daram-remove-set is identified, to ensure that there is gain in the backtracking

step, the data sections that are part of daram-remove-set needs to be placed in SARAM

banks and minimum placement cost has to be computed again for each of the data section.

If the sum of the minimum placement cost for all data sections in daram-remove-set is

greater than the original minimum cost for placing data i in SARAM bank b, then there

is no gain in backtracking and data section i is placed in SARAM banks. Else there is

gain in backtracking and the daram-remove-set is removed from DARAM and placed in

SARAM. Data section i is placed in DARAM. The overall-placement cost is updated. This

process is repeated for all data sections identified to be placed in the internal memory.

The overall-placement cost gives the memory cycles (Mcyc) for placing application data

for a given memory architecture.

3.6 Experimental Methodology and Results

3.6.1 Experimental Methodology

In this section, we explain the methodology used in our experiments. For our experiments,

the main inputs are the access characteristics of the data sections. We need the sizes of

the data sections, access frequency of each of the data sections and the conflict matrix.

The access frequency and the conflict matrix are obtained from a software profiler. Since

the DSP applications typically have simple control flow, the profile information on the

access characteristics does not change very much from run to run.


Table 3.2: Memory Architecture for the Experiments

Bank Type Number of Banks Bank Size (Words)DARAM 4 4096SARAM 2 32768

We have used the Texas Instruments TMS320C55XX processor [76] for our experi-

ments. This processor has three 16-bit memory read busses and two 16-bit memory write

busses and has the capability to read three 16-bit data and write two 16-bit data in the

same clock cycle. The memory architecture of the 55X device is given in Table 3.2. Note

that the total memory size is 72 Kwords and is large enough to fit each of the instances

of all the four applications reported in Table 3.4.

We have used the Texas Instruments Code Composer Studio V2.2 [73] to run the

applications. Initially the applications are compiled with the CCS2.2 compiler with the

default memory placement made by the compiler. The compiled application is loaded

and simulated in the simulator to obtain the profile information and the conflict matrix,

which are inputs to the heuristic and the Genetic algorithms.

We have developed a framework which automatically generates the ILP formulation for

an embedded application. The input for the ILP formulation generator, specified in XML

format, are application parameters, memory configuration parameters, and the profile

data parameters. To obtain profile information, we developed a profiler and integrated it

with the C5510 Instruction Set Simulator (ISS) [73].

The ILP formulation is solved using a public domain LP solver [3], viz., lp solve.

3.6.2 Integer Linear Programming - Results

To compare the performances of the different layouts, we consider the number of memory

stall cycles and sometimes MIPS consumed, which is commonly used in embedded systems

design. MIPS consumed refers to the processing capability required to guarantee real-

time performance for a given application. Thus higher the MIPS consumed, lower is the

performance of the layout. Further, lower the MIPS consumed, more applications, or


higher number of instances of the same application can be run on the embedded device,

guaranteeing real-time performance for all of these instances. MIPS consumed for a given

data layout is obtained by running the application on the simulator with the given layout

and for the given memory architecture. We report these numbers when a single instance

of the application is run on the embedded system.

First, we report the MIPS consumed by the optimal solution obtained using the fol-

lowing formulations for the four applications.

• The basic formulation is used to get a data layout considering only the internal and

external memory. The internal memory is considered as one single SARAM bank

of size 12K words. We use this as the baseline model. In Table 3.3, we report the

normalized MIPS consumed, the number of variables and the number of constraints

for each ILP formulation, and the time taken on a 900 MHz Pentium III machine1

to solve the ILP problem.

• The on-chip memory is split into multiple SARAM banks (three 4K word SARAM

banks) and the formulation that handles multiple memory banks was used. The

results for this case (refer to Table 3.3) show 14%, 16%, 3.8%, and 7.1% performance

improvement over the baseline case for the four applications.

• Next, the basic formulation is extended to handle SARAM and DARAM banks. For

this formulation we assumed that the internal memory consists of 2 banks, one, of 8K

word size, supporting single access (SARAM) and another, of 4K word, as DARAM.

This optimization gives a 16%, 18%, 4.8% and 7.4% performance improvement over

the baseline case for the four applications.

• The last experiment is performed by enhancing the basic formulation to handle both

multiple banks, and bank types (SARAM and DARAM). The memory configuration

considered here consists of two 4K word SARAM banks and one 4K word DARAM

1These experiments were run in 2002 on a set of proprietary benchmark with a desktop configurationwhich was state of the art at that time. We are unable to repeat the experiments on a more modernplatform due to portability reasons. We have however run another set of benchmarks on a recent desktopconfiguration and the results are reported in Section 3.6.3


bank. This optimization exploits both multiple memory banks and dual access

capabilities of the scratch pad memory and gives a significant reduction (28%, 30%,

9% and 13.8%) in MIPS consumption over the baseline case.

We remark that the somewhat lower performance improvement in Appln. 3 and Appln. 4

could be due to the fact that these are kernel codes where multiple simultaneous memory

accesses are not fully exploited.

Table 3.3: Experimental ResultsAppli- Parameter Basic Handling Handling Handlingcation Formu- Multiple DARAM multiple

lation banks banks andDARAM

Appln. 1 Normalized 1.0 0.86 0.84 0.70MIPS consumedILP Number 22 60 60 60of variablesILP Number 13 37 37 37of constraintsTime taken 2 sec 10 sec 10 sec 10 secto solve





Next we compare our solution with hand-optimized data layout. For the first two

applications (referred to as Appln. 1 and Appln. 2), we were able to obtain the hand-

optimized memory placement done by application developers. The application developers

had performed hand-optimization to come up with a number of different code and data

layouts for a fixed cost, measured in terms of DARAM size — a standard practice fol-

lowed in the industry. For each application, the application developers consider different

configurations, differing in terms of DARAM sizes, and obtain different optimal code

and data layouts for each of the configurations. Among these, they pick the one which

maximizes the performance without incurring excessive cost. This process took approxi-

mately 1 man-month for each of the two applications. This includes the time to analyze

all the layouts and come up with the best solution that delivers maximum performance

by occupying minimum DARAM size. To compare the hand-optimized solution, we used

the best case memory configuration, different for different applications, and obtained the

optimal layout using our ILP formulation. We compared the hand-optimized layout with

our optimal solution. The result generated by our formulation marginally betters the

hand-optimized memory placement in terms of MIPS consumption for the same DARAM

size. More importantly, it should be noted here that our approach can be automated to

solve the optimal data layout problem for different memory configurations and pick the

most appropriate one.

Lastly, we measure the time taken for obtaining an optimal solution in our approach in-

creases with the increasing complexity of the application. For this purpose, we considered

Application 2, (3 instances of a standard DSP application) to be running simultaneously

on the embedded device. We considered the same memory architecture with 80K of

off-chip memory, 1 × 4K DARAM bank and 2 × 4K SARAM bank. The complete for-

mulation is considered in this case, which can handle multiple memory banks, DARAM,

overlay and swapping. The number of variables and the number of constraints in the

formulation increased to 152 and 88 respectively. Even with the increased complexity,

the time for obtaining an optimal solution remained within a few minutes (8 minutes on

a 900Mhz Pentium III PC and 3 minutes on an Ultra SPARC-2). Even after including


the one-time simulation and profiling cost, the total time taken by our approach is within

30 minutes. Thus the proposed ILP approach can result in a significant reduction in the

system integration time (from a few man-months to a few hours) even for moderately

complex systems.

From the experiments it is clear that our unified approach opens up many optimiza-

tion possibilities. We observe that when applications are used in multi-channel systems

(instantiated multiple times), the overlay optimization can result in significant reduction

in the on-chip memory usage with the same MIPS consumption.

Thus, we anticipate that for a more complex application, the performance of our

optimal data section layout solution will exceed the hand-optimized performance.

3.6.3 Heuristic and GA Results

To evaluate the performance of the heuristic and GA, we used the same 4 different embed-

ded DSP applications explained in the previous section. Table 3.4 reports the performance

of our Heuristic method and our GA. Column 1 shows the benchmark and column 2 in-

dicates the number of instances of this module in the application. Column 3 shows the

number of data sections in the application. Column 4 is the number of conflicts (sum of

both parallel and self conflicts), without any optimization. Column 5 indicates the unre-

solved conflicts when the heuristic placement algorithm is used for data layout. Similarly,

column 6 shows the number of unresolved conflicts for the genetic algorithm. We observe

that both methods eliminate more than 90% the total number of conflicts. In the case

of the JPEG decoder, the algorithms resolved all the conflicts and obtained an optimal

memory placement. Although the performances, in terms of the unresolved conflicts, of

the heuristic and GA method are comparable, the GA method performs better for mod-

erate and large problems. We observe that for large problems the ILP method could

not get the optimal solution even after hours of computation. Further, we believe that

by tuning some of the parameters of the GA method (e.g., the cross-over and mutation

probabilities, the size of the population, and number of generations), GA method can

be made to perform significantly better even for large applications, and obtain close to


Table 3.4: Results from Heuristic Placement (HP) and Genetic Placement (GP) on 4Embedded Applications, VE = Voice Encoder, JP = JPEG Decoder, LLP = Levinson’sLinear Predictor, 2D = 2D Wavelet Transform.

App # instances # data # conflicts # conflicts # conflicts # conflictssections HP GP ILP

VE 1 12 356813 436 130 02 24 713626 1132 606 1123 36 1070439 13449 11883 no result4 48 1427252 67721 62390 no result5 60 1784065 129931 126365 no result6 72 2140878 190134 180591 no result

JP 1 14 94275 0 0 02 28 188440 0 0 03 42 282825 0 0 04 56 377100 0 0 0

LLP 16 80 556992 19639 19386 no result32 160 1113984 165865 164395 no result

2D 4 23 5632707 33368 32768 327686 31 6727827 35768 33968 327688 39 7821867 37568 36368 32768

10 47 8916447 45638 38168 36368

optimal memory placement

Figure 3.4 shows the performance of the genetic algorithm when the number of gen-

erations is increased from 700 to 4000. The bar charts correspond to (a) the run-time of

the GA in logarithmic scale, and (b) the difference of the number of conflicts resolved by

the heuristic and the GA. Notice that when the number of generations is less than 3000,

the GA is in the catch-up mode. When the number of generations reaches 4000, the GA

outperforms the heuristic algorithm.

3.6.4 Comparison of Heuristic Data Layout with GA

We compared our heuristic data layout performance with GA’s data layout. Figure 3.5

presents the normalized performance of heuristic data layout. We randomly picked 100

different architectures for obtaining data layout for the Voice Encoder application. For

these 100 points, we ran both GA and our heuristic algorithm. The x-axis represents test


Figure 3.4: Relative performance of the Genetic Algorithm w.r.t. Heuristic, for VaryingNumber of Generations

Data Layout (Heuristic vs SA)

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100 120

Test No

No

rmal

ized

per

form

ance

Figure 3.5: Comparison of Heuristic Data Layout Performance with GA Data layout

case identifier from 1 to 100. The y-axis presents the memory stall cycles of heuristic

normalized by GA’s memory cycles (M gacyc/M

heucyc ). It can be observed that the heuristic

method performs as well as GA for most of the points. The worst performance of heuristic

is approximately 25% below GA’s performance for two of the test cases. However, an

average heuristic performs at 98% efficiency, in terms of solution quality as compared to

GA. The execution time of GA for completing all 100 data layout is approximately 22

hours, whereas the execution time of heuristic for completing all 100 placements is less

than a second on a Pentium P4 Desktop machine with 1GB main memory operating at

1.7 GHz. Thus the heuristic method is an attractive option, providing efficient solution


in a very little execution time, for solving large number of data layout problems which is

required in a memory architecture exploration problem. Note that for some of the points

heuristic performs better than GA, this may be due to the termination of GA at fixed

iterations. Based on the above results we can conclude that the heuristic algorithm is fast

and also very efficient.

3.6.5 Comparison of Different Approaches

In Table 3.5, we provide a qualitative comparison of the heuristic algorithm, the genetic

algorithm, and the ILP based for the data layout problem. We see that the run-time

of the heuristic is the lowest among the three approaches. The run-time of the Genetic

algorithm depends on the number of generations, the population size P , and the number

of offspring per generation M . For the four large test cases of Table 3.4, the run-time

of the heuristic algorithm was of the order of 1 second, whereas the GA took about 20

minutes to complete. The ILP approach, on the other hand, required several hours to

converge to the optimal solution. As a matter of fact, a public domain ILP solver could

not converge in 24 hours for the 6-instance Voice Encoder and the 32-instance Levinson’s

LPC applications. This clearly demonstrates that the GA and the heuristic methods

are attractive from the point of view of quickly solving th data layout problem. From

the view point of optimality of solutions, ILP is guaranteed to converge to the optimal

and is hence ranked the best. The GA comes second since it provides better solutions for

larger problem instances. From the viewpoint of flexibility, we believe GA is most flexible,

since the cost function can be easily reflected in the fitness measure. For example, if the

power or energy dissipation must be minimized, we can modify the fitness function to

be a weighted average of the power and performance metrics and still reuse the genetic

algorithm framework. This becomes difficult for the heuristic algorithm to simultaneously

optimize both performance and power.

From Table 3.5, we see that each of the three approaches for the data placement

problem has a definite advantage over the other two methods in terms of run-time, quality,

or flexibility. This point can be exploited as follows. The algorithms presented in the


Table 3.5: Comparative Ranking of Algorithms

Optimization Run-Time Quality of FlexibilityApproach SolutionHeuristic Best Worst IntermediateGA Intermediate Intermediate BestILP Worst Best Worst

previous section are intended to optimize the placement of data sections for a given

memory architecture. Often, the designers of the SoC have the flexibility to change the

memory architecture. It would be ideal if the memory architecture optimization and

the data section placement were to be done concurrently. This is a classic example of

hardware-software codesign. We propose a “multi-objective Genetic Algorithm” technique

for memory architecture exploration, where the number and size of the SARAM and

DARAM banks can be determined through a combinatorial search process. For each of

the memory architecture considered, a quick “fitness” can be computed based on the cost

of the best placement obtained using the heuristic algorithm. The heuristic is an ideal

choice for computation of the bound since it is the fastest approach, requiring less than 1

second of run time. After a small number of competing memory architectures have been

shortlisted through this procedure, the GA can be used to explore the best placement of

data sections for each of the memory configurations.

3.7 Related Work

Several efficient heuristic approaches for data layout have been published in the literature

[10, 34, 40, 44, 53, 55, 58, 67, 71, 79]. These can be classified as static and dynamic

methods. In static data layout, the memory addresses for all data variables are decided at

compile time and do not change at run-time. In dynamic data layout [79], on-chip SPRAM

is reused by overlaying many data variables to the same address. Thus, two addresses are

assigned to a variable at compile time, namely, load address and run address. A variable

is loaded at the load address and copied to run-address at run-time. At the cost of

3.7 Related Work 63

increased complexity, overlaying attempts to improve the system efficiency by delivering

better run-time performance with lower SPRAM size.

Avissar et al., [10] present an Integer Linear Programming (ILP) based technique for

compiler to automatically allocate data to on-chip memory or external memory based

on the access frequency. In [10], the authors handle the data partitioning problem for

globals, and stack variables. They propose an approach to partition the stack into two

parts. They are the first to present a solution for partitioning stack variables into critical

and non-critical sections such that the critical stack variables can be mapped to on-

chip memory and the non-critical stack variables are mapped to off-chip memory. This

approach will require additional run-time as two stacks have to be managed. They also

propose an alternate approach, which doesn’t take additional run-time, where all local

variables of a function are allocated in one of the stack (on-chip/off-chip stack) with

some performance loss. Their memory architecture considers only on-chip and external

memory, without considering multiple memory banks. Multiple memory banks are part

of every DSP based embedded application and it is very critical to take this into account

during data layout. Also our observation is that ILP based approaches typically take

more time (a few hours) to converge for moderate to complex applications.

Leupers et al., [44] present an interference graph based approach for partitioning vari-

ables that are simultaneously accessed in different on-chip memory banks. For a given

interference graph, the Integer Linear Programming (ILP) approach is used to solve the

problem of maximizing the weights (number of parallel accesses). This work does not con-

sider DARAM, which is very important for DSP applications. Also ILP based approaches

may not be practical for complex applications. Also they consider improving only the

run-time; but in this thesis we look at both power and run-time and we show there is big

trade-off that is possible between power and performance in Chapter 5. To avoid the cycle

costs from self-conflicts (multiple simultaneous accesses to the same array), [58] suggests

partial duplication of data in different memory banks.

In [40], Ko et al., present a simple heuristic to partition data with the objective

of resolving parallel conflicts and also balance the size of the partitions. Balancing is


important as typically programmable processors have equal sized memory banks. This

work uses benchmarks written in ’C’ and hence the conflict graph is very sparse and only

bipartite graphs are obtained from the compiler. Because of this, they could resolve all

the conflicts and their main focus is only on balancing the data partitions. But typically

the DSP applications will have dense graph as the critical part of software is developed

in hand optimized assembly. Their work does not address parallel conflicts between more

than two arrays. Also they do not consider dual access RAMs. Their objective is to

reduce the data conflicts and improve run-time.

In [71], Sundaram et al., present an efficient data partitioning approach for data ar-

rays on limited-memory embedded systems. They perform compile time partitioning of

data segments based on the data access frequency. The partitioned data footprints are

placed in local or remote memory with the help of 0/1 Knapsack algorithm. Here the

data partitioning is performed at a finer granularity and because of this, the address com-

putation needs to be modified for functional correctness. In contrast, in our work, the

data partitioning is performed at the data section level and the data layout optimization

is performed by considering a more complex on-chip memory architecture with multi-

ple single and dual port memory banks. Further, no additional address computation or

modification to address computation is required in our approach.

Kulkarni et al., [41] present formal and heuristic algorithms to organize the data

in the main memory with the objective of reducing cache conflict misses. In [55], a

data partitioning technique is presented that places data into on-chip SRAM and data

cache with the objective of maximizing performance. Based on the life times and access

frequencies of array variables, the most conflicting arrays are identified and placed in

scratch pad RAM to reduce the conflict misses in the data cache. This work addresses

the problem of limiting the number of memory stalls by reducing the conflict misses

in the data cache through efficient data partitioning. Our work addresses the problem

of reducing the memory stalls by efficient data partitioning within the on-chip scratch

pad RAM itself. Also our work addresses the data layout for DSP applications, where

resolving self and parallel conflicts by efficient partitioning of data variables is very critical

3.8 Conclusions 65

for achieving real-time performance. Lastly, the memory architecture considered in the

initial part of our thesis does not have data cache; in Chapter 7, we consider memory

architecture with on-chip RAM and caches.

3.8 Conclusions

In this chapter, we described three approaches to solve the data placement problem in

embedded systems. Given memory architecture, the placement of data sections is crucial

to the performance of the system. Badly placed data can result in a large number of

memory stalls. We consider a memory architecture that consists of on- chip single-access

RAM with multiple memory banks, on-chip Dual-access RAM, and external RAM. We

analyze the application for data conflicts using a profiling tool and create a matrix rep-

resentation of the conflict information. We present three different methods to address

data layout problem: (a) ILP formulation, (b) Genetic Algorithm and (c) Greedy back-

tracking heuristic algorithm. The greedy back-tracking heuristic and Genetic Algorithm

approaches out perform the ILP based formulation in terms of the time to solve the data

layout problem. However, the ILP and the GA methods produce better quality of results

especially for large-sized applications. The framework of the GA is generic enough to

permit other cost functions such as power dissipation [64] to be incorporated. In Chap-

ter 5 we extend the GA formulation to consider performance and power minimization.

Similarly, the GA can also be extended to concurrently explore alternative memory ar-

chitectures [54] – this is possible by changing the representation of the chromosome and

reworking the crossover and mutation operations.


Chapter 4

Logical Memory Exploration

4.1 Introduction

In the previous chapter, we discussed data layout methods to find optimal and near-

optimal placement of data for a given fixed memory architecture for embedded DSP

processors. In this chapter we will focus on memory architecture exploration for a given

application in order to obtain memory architecture performance (reduced memory stalls)

and memory area. In Chapter 5 we extend our approach to consider the power consump-

tion also as an objective.

As discussed in Chapter 1, embedded systems are application specific and hence em-

bedded designers study the target application to understand the memory architecture re-

quirements. DSP applications are typically data intensive and require very high memory

bandwidth to meet real-time requirements. There are two steps to designing an optimal

memory architecture for a given application. The first step is to find the right memory ar-

chitecture parameters that are important for improving target application’s performance

and the second step is to optimally map the given application on to the memory architec-

ture under consideration. This leads to a two-level optimization problem with multiple

objectives. At the first level, an appropriate memory architecture must be chosen which

includes determining the number and size of each memory bank, the number of memory

ports per bank, the types of memory (scratch pad RAM or cache), wait-states/latency

68 Logical Memory Exploration

etc. Thus the number of memory architecture possible for an SoC for a given application

are many. The objective functions at this level are the memory system-cost, performance,

and power dissipation. However, the performance for a given application for a given ar-

chitecture depends on the appropriate placement of code and data sections in the various

on-chip memory banks or the off-chip memory modules. Hence, at the next level, for a

given application, the code and data sections must be placed optimally in memory to

minimize the number of stall cycles. As discussed in the previous chapter, the number

of placements for a given architecture are also many. Thus the number of memory ar-

chitectures and the number of data placements are formidably large. Hence an optimal

solution for the memory space exploration problem which involves exploring a large de-

sign space. A performance optimal solution may not be optimal in terms of cost or power

consumption. In this solution space there are several interesting design points known as

Pareto-optimal points for the embedded system design. This is especially the case as an

embedded system designer typically designs multiple variant of a embedded product (to

meet different market segments) and hence would want to obtain several good solutions

which may make sense for different application segments. Hence the memory space ex-

ploration problem should identify multiple Pareto-optimal design points. Further since

embedded system products are often designed under tight time-to-market constraints, the

resources available for such an optimization process are limited. To make the problem

more complex, the market space is volatile and frequently the top-level specification and

architecture may be redefined during the life cycle of a product.

In this chapter, we propose an efficient methodology for the memory architecture

exploration of the DSP core1. We concentrate mainly on the DSP core as it largely

determines the performance of the embedded application. We consider both on-chip and

off-chip memory space exploration for the DSP core. The memory architecture exploration

problem involves identifying the appropriate memory architectures for a given application,

in terms of performance, power consumption and cost. As mentioned earlier, this involves

1In addition to the DSP core, an embedded SoC will have a micro-controller which has embeddedmemory. In this thesis we do not focus on the memory system design of embedded micro controllersthough many of the methods proposed may be applicable to microcontrollers as well.

4.1 Introduction 69

solving two interacting problems: (a) memory architecture exploration and (b) data layout

optimization for the architecture considered.

Previous work on data layout [10, 41, 44, 53, 58, 71] has focused on addressing the

layout problem independently, for a given memory architecture, by considering the objec-

tive either as improving application run-time or energy consumption. Also the previous

work in this area has addressed the data layout either for memory architecture on the

embedded side (microcontroller), where they do not consider dual-port memories, or on

the DSP side, where the on-chip/off-chip partitioning is not considered. A detailed com-

parison with related work is presented in Section 4.6. To the best of our knowledge,

there is no work which considers integrating memory architecture exploration and data

layout to explore memory design space by targeting multiple objectives. This integrated

approach is very critical to navigate through the search space in the right direction to

explore memory design space to obtain multiple Pareto Optimal design points.

In this chapter we propose an iterative two level integrated approach for the data layout

and memory exploration problem. At the outer level, for the architecture exploration, we

have used multi-objective evolutionary algorithm. We propose both Genetic Algorithm

(GA) formulation and an Simulated Annealing (SA) formulation for this problem. For

the inner level, i.e., the data layout problem, we have used a simple and fast heuristic

algorithm described in Section 3.5; this is because the data layout’s problem is solved

for several thousands of memory architectures. As discussed in Chapter 3 the heuristic

algorithm proposed there performs reasonably well in reducing the memory stalls and at

the same time obtains the data layout in very little computation time (less than 1msec).

In comparison, the GA or the ILP approach takes few minutes to few hours for each data

layout problem which can become prohibitively expensive when solving for a large number

of memory architectures,

The main contributions of this chapter are (a) proposing an iterative two level based

solution to address the data layout and architecture exploration as an integrated problem

(b) proposing performance (in terms of memory stalls) and memory area as two objec-

tives for the memory exploration framework; and (c) proposing a memory exploration


framework that is fully automatic. The proposed memory exploration framework is flex-

ible and can be configured to explore additional memory architecture parameters. Also

the framework is scalable and additional objectives like power consumption can be added

easily. We have used 4 different multimedia and communication applications for our ex-

periments. Our proposed memory exploration method gives up to 130-200 Pareto optimal

design choices (memory architectures) for each of the applications.

The rest of the chapter is organized as follows. Section 4.2 provides necessary back-

ground on the data layout and memory architecture exploration. Section 4.3 describes the

multi-objective Genetic Algorithm (GA) formulation of the memory exploration problem.

Section 4.4 explains the formulation of memory architecture exploration problem in Sim-

ulated Annealing (SA). Section 4.5 reports the experimental results. Section 4.6 describes

the related work. We present conclusions and future work in Section 4.7.

4.2 Method Overview

4.2.1 Memory Architecture Parameters

As discussed in section 2.1.1, the memory architecture of a DSP processor has to support a

high bandwidth to satisfy the needs of data memory intensive DSP applications. As shown

in the Figure 4.1, the memory architecture of a DSP processor is organized as multiple

memory banks, where each bank can be accessed independently to enable parallel accesses.

In addition each of the bank can be a single port or a dual port memory. For now we

assume that the memory banks with single ports have the same size and similarly the

memory banks with dual-ports have the same size. Also, at this point, we only consider

a logical view of the memory architecture. How the different (logical) memory banks are

realised using different physical memories from a given ASIC design database, and how

they impact the power, performance and cost of the memory architecture will be discussed

in the next chapter. Choosing the appropriate physical memory architecture is a design

space exploration process. We use the terms logical memory exploration and physical

memory exploration to clearly distinguish between the two. This chapter concentrates on

4.2 Method Overview 71

the former whereas the following chapter deals with the later.

Figure 4.1: DSP Processor Memory Architecture

The Table 4.1 describes the memory types and parameters. There are 3 types of mem-

ory (single-ported on-chip memory (Sp), dual-ported on-chip memory (Dp) and External

memory). The parameters to be explored are the number of Sp memory banks (Ns), the

Sp bank size (Bs), the number of Dp memory banks (Nd), the Dp bank size (Bd) and the

size of external memory (Es). For example, 64KB on-chip memory of C5503 DSP is or-

ganized as 12×4KB single port memory banks and 4×4KB dual-port memory banks. For

this example, the parameters can be specified as Ns = 12, Bs = 4096, Nd = 4, Bd = 4096

and Es = 0.

4.2.2 Memory Architecture Exploration Objectives

We consider the following two objectives while exploring the (logical) memory design

space: (a) Memory Stall Cycles (which is a critical component of system performance)

and (b) Memory cost. A meaningful estimation of the power consumed can be made only


Table 4.1: Memory Architecture ParametersNotation Description

Sp Single-port memory bankDp Dual-port memory bankNs Number of SARAM banksBs SARAM bank sizeNd Number of DARAM banksBd DARAM bank sizeEs External Memory sizeDs Total data sizeWs Normalized weight for SARAMWd Normalized weight for DARAMWe Normalized weight for external memory

with the physical memory architecture, and hence we defer the power objective to the

following chapter on physical memory exploration.

Memory cycles (Mcyc) is the sum of all memory stall cycles where the CPU is waiting

for memory. This includes stall cycles spent in on-chip memory bank conflicts and off-chip

memory latency. Our objective is to minimize the number of stall cycles (Mcyc). It is very

critical to have an efficient data layout algorithm to obtain a valid Mcyc. Note that if an

efficient data layout algorithm is not used then the data mapping may not be optimal

leading to higher number of Mcyc even for a good memory architecture. This may lead

the memory architecture exploration search in a completely wrong direction.

Memory cost is directly proportional to the silicon area occupied by memory. Since

the memory silicon area is dependent on the silicon technology, memory implementation,

and the ASIC cell library that is used, instead of considering the absolute silicon area

numbers, for now, we consider the relative (logical) area. The memory cost is defined by

the equation (4.1).

Mcost = (Ws ∗ (Ns ∗Bs) + Wd ∗ (Nd ∗Bd) + We ∗ Es)/Ds (4.1)

The cost of Sp is taken to be one unit i.e. Ws = 1. Since a Dp cell is approximately

2-3 times as large as an Sp cell, we set Wd to 3. We use We = 1/(2 ∗ Ewait), where Ewait

4.2 Method Overview 73

is the memory access latency, which is set to 10. We normalize the memory cost with the

total data size in equation (4.1).

The objective of the memory architecture exploration is to find a set of Pareto optimal

memory architectures that are optimal in terms of performance for a given area. The

two objectives, memory architecture performance and memory area, are conflicting and

hence it is not a clear minimization or maximization problem. For such multi-objective

problems, it is not possible to compare two solutions. For example if we are evaluating

a set of solution on a pair of objectives say M1 and M2, and both objectives need to

be minimised then two solutions cannot be compared if each is better than the other in

terms of one of the objectives. To evaluate a design solution we use the Pareto Optimality

condition described in section 2.5.

4.2.3 Memory Architecture Exploration and Data Layout

Solving the memory architecture exploration problem involves solving two interacting

problems: (a) architecture exploration and (b) data layout optimization for the architec-

ture considered. The data layout determines the performance of the given application in

the considered architecture, which in turn determines whether the memory architecture

to be considered further in the design space exploration. We propose a two-level iterative

approach for this problem.

At the outer level, for the memory architecture exploration, we have tried two evolu-

tionary approaches (a) multi-objective Genetic Algorithm formulation and (b) Simulated

Annealing. We have compared the results from GA and SA, we find the GA performs

better for our multi-objective problem. At the inner level, for the data layout, we use the

Greedy back-tracking heuristic algorithm described in section 3.5.

In the next two sections the GA and SA formulations for memory architecture explo-

ration is described.


Figure 4.2: Two-stage Approach to Memory Subsystem Optimization

4.3 Genetic Algorithm Formulation

In this section, first, we describe the GA formulation for the memory architecture explo-

ration. Section 4.3.2 deals with how the non-dominated GA [25] approach has been used

to perform the multi objective search. The multiple objectives considered in this chapter

are, reducing memory cycles (Mcyc) and memory cost (Mcost) as explained in Section 4.2.2.

4.3.1 GA Formulation for Memory Architecture Exploration

To map an optimization problem to the GA framework, we need the following: chro-

mosomal representation, fitness computation, selection function, genetic operators, the

creation of the initial population and the termination criteria.

For the memory exploration problem, each individual chromosome represents a mem-

ory architecture. A chromosome is a vector of 5 elements: (Ns, Bs, Nd, Bd, Es). The

parameters are explained in Section 4.2.1 and they represent the number and size of single

and dual ported memory bank and the size of the external memory.

4.3 Genetic Algorithm Formulation 75

Fitness function computes the fitness for each of the individual chromosomes. For the

memory exploration problem there are two objectives Memory cost (Mcost) and Mem-

ory Cycles (Mcyc). For each of the individuals, the fitness function computes Mcost and

Mcyc. The memory cost(Mcost) is computed from equation (4.1) based on the memory

architecture parameters (Ns, Bs, Nd, Bd, Es) that defines each of the chromosome.

The memory stall cycles (Mcyc) is obtained from the data-layout that maps the ap-

plication data buffers on to the given memory architecture, defined by the chromosome’s

parameters, with the objective to reduce memory stall cycles. We use the greedy back-

tracking heuristic algorithm described in Section 3.5 for the data layout. The data layout

algorithm estimates the memory stall cycles after placing the application data buffers in

the given memory architecture.

Once Mcyc and Mcost are computed for all the individuals in the population, the

individuals need to be ranked. The non-dominated points at the end of the evolution

represent a set of solutions which provide interesting trade-offs in terms of one of the

objectives in order to anhilate the chromosomes that has a lower fitness. For a single

objective optimization problem, the ranking process is straightforward and is proportional

to the objective. But for multi-objective optimization problem, the ranking needs to be

computed based on all the objectives. We describe how we do this in the following

subsection.

4.3.2 Pareto Optimality and Non-Dominated Sorting

First we define Pareto optimality. Let (Macost,M

acyc) be the memory cost and memory

cycles of chromosome A and (M bcost,M

bcyc) be the memory cost and memory cycles of

chromosome B then, A dominates B if the following expression is true.

((Macost < M b

cost) ∧ (Macyc ≤ M b

cyc)) ∨ ((Macyc < M b

cyc) ∧ (Macost ≤ M b

cost)) (4.2)

The non-dominated points at a given generation are those which are not dominated


by any other design points seen so far.

We use the non-dominated sorting process described in [25] for ranking the chromo-

somes based on the Mcyc and Mcost. The ranking process in multi-objective GA proceeds

as follows. All non-dominated individuals in the current population are identified and

flagged. These are the best individuals and assigned a rank of 1. These points are then

removed from the population and the next set of non-dominated individuals are identified

and ranked 2. This process continues until the entire population is ranked. Fitness values

are assigned based on the ranks. Higher fitness values are assigned for rank-1 individuals

as compared to rank-2 and so on. This fitness is used for the selection probability. The

individuals with higher fitness gets a better chance of getting selected for reproduction.

Mutation and cross over operations are used to generate offsprings. These operators are

defined in a similar way as in Section 3.4.

One of the common problems in multi-objective optimization is solution diversity.

Basically the search path may progress towards only certain objectives resulting in design

points favoring those objectives. Hence, solution diversity is very critical in order to get a

good distribution of solutions in the Pareto-optimal front. To maintain solution diversity,

the fitness value for solution that are in the same neighborhood are given lower values.

To maintain solution diversity we use the sharing function method explained in [25].

Following steps explain the fitness assignment method among the individuals in the same

rank. To begin with, all solutions with rank = 1 are assigned equal fitness. This becomes

the maximum fitness that any solution can have in the population. Given a set of nk

solutions in the k-th rank each having a fitness value fk, the following steps reassign the

fitness values for solutions based on the number and proximity of neighboring solutions

(this is known as the niche count [30]).

Step 1: The normalized Euclidean distance for solution i with solution j for all nk

solutions in the kth rank is defined as:

dij =

√√√√√(

M icost −M j

cost

Mucost −M l

cost

)2

+

M i

cyc −M jcyc

Mucyc −M l

cyc

2

(4.3)

4.3 Genetic Algorithm Formulation 77

where Mucost and M l

cost represent respectively the highest and lowest values for Mcost

seen across all solutions; similarly Mucyc and M l

cyc denote the highest and lowest number

of stall cycles.

Step 2: The distance dij is compared with a pre-defined parameter σshare and the

following sharing function value is computed [25]:

Sh(dij) =

1−(

dij

σshare

)2, if dij < σshare

0, otherwise.

Step 3: Calculate niche count for the ith solution in rank k as follows:

mi =nk∑

j=1

Sh(dij)

Step 4: Reduce the fitness fk of i-th solution in the k-th rank as:

f ′i =fk

mi

The above steps have to be repeated for all the ranks. Note that the niche count

(mi) will be greater than one for solution i that has many neighboring points. For a

lone distant point the niche count will be approximately 1. Thus greater fitness values

are assigned to points that do not have many close neighboring solutions, encouraging

them to get selected for reproduction. Once all the nk individuals in rank k are assigned

fitness based on the above steps, the minimum of fitness is taken as the starting fitness

and assigned to all the individuals in rank k + 1.

After some experimentation we fixed the σshare as 0.6 as the initial value and decrease

the σshare up to 0.25 based on the number of generations and the number of non-dominated

points in rank 1.

The GA must be provided with an initial population that is created randomly. GAs

move from generation to generation until a pre-determined number of generations is seen

or the change in the best fitness value is below certain threshold. In our implementation


we have used a fixed number of generations as the termination criterion.

4.4 Simulated Annealing Formulation

Simulated Annealing (SA) is a combinatorial search algorithm initially proposed by [39]

and has been applied to solve many hard optimization problems in the area of VLSI and

Embedded Systems design [57]. The algorithm employs hill climbing to avoid entrapment

into local optimal solutions. If S is the current solution and S ′ is a randomly chosen

solution in the neighborhood of S, the new solution is accepted with a probability p =

exp(−δ/T ), where δ is the change in the objective function and T is a control parameter

called temperature. Note that when δ < 0, the solution S ′ is an improvement and is

always accepted. When δ > 0, the solution S ′ is a degraded solution, but the annealing

algorithm will accept it with a small probability which depends on the magnitude of

the degradation and the control parameter T . The control temperature is initialized to

Tinit and slowly reduced by a factor α until it reaches Tfinal. At lower temperatures, the

algorithm degenerates into a greedy search, while at high temperatures, the algorithm

behaves like a random search. Simulated Annealing is a popular technique for addressing

problems such as standard cell placement [84], global routing [48], and logic synthesis

[20]. Its biggest advantage is its simplicity and generic applicability. In this section, we

show how the memory subsystem optimization problem can be solved in the framework

of simulated annealing.

4.4.1 Memory Subsystem Optimization

Recall the architectural parameters used in an memory architecture exploration and their

notations as summarized in Table 4.1.

A solution S to the memory subsystem optimization problem is a tuple (A,L) where

A represents the memory architecture and L represents the data layout. Two objec-

tive functions2 are associated with each solution, namely, Memory Cost (Mcost) and

2Energy consumption (Menergy) can be considered as the third objective function if power is also

4.4 Simulated Annealing Formulation 79

Table 4.2: Evaluation of Multi-Objective Cost Function

Memory cycles Memory Cost Assessment(Mcyc(S

′) == Mcyc(S)) (Mcost(S′) == Mcost(S)) same solution

(Mcost(S′) < Mcost(S)) Improvement

(Mcost(S′) > Mcost(S)) Deterioration

(Mcyc(S′) < Mcyc(S)) (Mcost(S

′) == Mcost(S)) Improvement(Mcost(S

′) < Mcost(S)) Improvement(Mcost(S

′) > Mcost(S)) need analysis(Mcyc(S

′) > Mcyc(S)) (Mcost(S′) == Mcost(S)) Deterioration

(Mcost(S′) < Mcost(S)) need analysis

(Mcost(S′) > Mcost(S)) Deterioration

Memory Cycles (Mcyc). Memory cost is computed using the equation (4.1). Memory

cycles is obtained using the same greedy backtracking heuristic described in Section 3.5

for the data layout.

The initial temperature is computed by randomizing the solution space initially. The

mean and the variance are computed from the δ (improvement or deterioration) through

the initial iterations during the randomization process. Temperature is initialized with

the standard-deviation of δ calculated during the initial iterations.

To generate a new solution S ′ from the current solution S, we use controlled random-

ization to ensure that S ′ is in the neighborhood of S and that the new solution represents

a valid solution. For example, we change the number of banks in S ′ by adding a randomly

generated offset to the number of banks in S while ensuring that the total does not ex-

ceed the maximum number of banks. We also ensure that the memory size is greater than

the data size. Let us denote the memory cost and memory cycles associated with S as

Mcost(S)and Mcyc(S); similarly, let Mcost(S′)and Mcyc(S

′)correspond to solution S ′. The

new solution is a definite improvement if Mcost(S′) ≤ Mcost(S) and Mcyc(S

′) ≤ Mcyc(S)

and we say S ′ dominates S. When the new solution does not dominate the existing

solution, there are many possibilities, as illustrated in Table 4.2.

We maintain an upper and lower threshold for the objective functions; let (Mutcyc,M

ltcyc)

be the limits for memory cycles and (Mutcost, M

ltcost) be the limits for memory cost. The

considered in the optimization. We defer this to Chapter 6.


change in the overall cost function is computed as follows.

δ =((Mcost(S

′)−Mcost(S))/(Mutcost −M lt

cost) + (Mcyc(S′)−Mcyc(S))/(Mut

cyc −M ltcyc)

)

(4.4)

Since our objective is to present to the designer a list of all good solutions, we maintain

a list of competitive solutions seen during the course of optimization. Each of these

solutions is assigned a weight. When a new locally good solution is encountered, we

compare its weight with that of all the globally competitive solutions that have been

seen so far. There is fixed amount of room in the data structure that stores globally

competitive solutions; as a result, we will remove a solution from the list if its weight is

lower than that of all others, including the new entrant.

4.5 Experimental Results


We have implemented the multi-objective Genetic Algorithm (GA), Simulated Annealing

(SA) and the heuristic data-layout algorithm in a standard desktop as a framework to

perform memory architecture exploration. Some practical implementation constraints are

applied on the memory architecture parameters to limit the search space. For example,

the memory bank sizes (Bs, Bd, and Es) are restricted to powers of 2 as it is done in practi-

cal implementations . As before we have used Texas Instruments (TI) TMS320C55X and

Texas Instruments Code Composer Studio (CCS) environment for obtaining the profile

data and also for validating the data-layout placements. We have used the same set of 4

different applications used in the previous chapter from the multimedia and communica-

tions domain as benchmarks to evaluate our methodology. The applications are compiled

with the C55X processor compiler and assembler. The profile data is obtained by running

the compiled executable in a cycle accurate SW simulator. To obtain the profile data

we use a memory configuration of a single large bank of single-access RAM to fit the

4.5 Experimental Results 81

application data size. This configuration is selected because this does not resolve any of

the parallel or self conflicts, so the conflict matrix can be obtained from this simulated

memory configuration. The output profile data contain (a) frequency of access for all

data sections (b) the conflict matrix. The other inputs required for our method is the

application data section sizes, which were obtained from the C55X linker.

4.5.2 Experimental Results

This section presents our results on the memory architecture exploration. We have applied

GA and SA for the memory architecture exploration. Both GA and SA based methods

uses the same data layout heuristic described in Section 3.5. The reason for trying two

different evolutionary schemes for memory architecture exploration is to identify the better

approach between GA and SA for the multi-objective problem at hand. A better approach

will be able to search the design space uniformly and identify non-dominated points that

are globally Pareto-optimal. In this section, first we compare the results from GA and

SA, and then we describe our observations of the exploration process based on the better

approach (GA or SA).

4.5.2.1 Comparison of GA and SA Approaches

The objective is to obtain the set of Pareto-optimal points that minimizes either memory

cost or the memory cycles. For one of the benchmark, Vocoder, Figure 4.3 plots all

the memory architectures explored by GA and SA respectively, each point represents a

memory architecture and the non-dominated points or the Pareto optimal points are also

plotted in the same figure. Note that each of the non-dominated points represents the

best memory architecture for a given Mcyc or Mcost.

In Figure 4.3, the x-axis represents the normalised memory cost as calculated by

equation 4.1. We have used Ws = 1, Wd = 3 and We = 0.05 in our experiments. Based

on these values and from equation (4.1), the important data range for x-axis (Mcost) is

from 0.05 to 3.0. It can be seen that Mcost = 0.05 corresponds to an architecture where

all the memory is only off-chip memory, while Mcost = 3.0 corresponds to a memory


Figure 4.3: Comparison of GA and SA Approaches for Memory Exploration


architecture that has only on-chip memory composed of DARAM memory banks. The

y-axis represents the memory stall cycles, which is number of processor cycles spent in

data accesses. This includes the memory bank conflicts and also the additional wait-states

for data accessed from external memory.

Figure 4.4: Vocoder Non-dominated Points Comparison Between GA and SA

From Figure 4.3, we observe that the multi-objective GA explores the design points

uniformly in all regions of memory cost, whereas SA explores a large number of design

points only in the region of Mcost < 1. Our observation is that the sharing and niche

formation methods used in GA leads to better solution diversity than SA. This trend is

observed in other benchmarks as well.

Table 4.3 presents the number of memory architectures explored and the number of

non-dominated points obtained from the GA and SA based approaches. For each of the

applications, the GA and SA are run for a fixed time, so as to compare the efficiency of the

two approaches. The execution time reported in Table 4.3 is the time taken on a Pentium

P4 Desktop machine with 1GB main memory operating at 1.7 GHz. From Table 4.3 we

observe that both GA and SA explore a large number of design points (a few thousand) for

each of the benchmark and identifies a few hundred Pareto optimal design points which

are interesting from a platform based design [59] viewpoint. The total computation time


taken by these methods for each benchmark varies from 3 hours to 11 hours. Compared

to this, the memory space design exploration typically done manually in industry, can

take several man-months and may explore only few design points of interest.

We observe that GA produces most of the non-dominated points in the first 25% of

time and slowly improves the solution quality after that. On the other hand, SA gives

the best results only towards the end when the annealing temperature approaches zero.

Hence, given sufficient time SA catches-up with GA but the time taken by SA to reach

the solution quality of GA is 2-3 times more than the GA’s run-time.

We observe that GA explores significantly more points in the design space (almost

by a factor of 2 to 3) than SA for all applications except Mpeg Enc. This is due to

the higher execution time of 11 hours. SA’s performance levels increase with time and

we observe that the number of global non-dominated points is highest for MPEG enc.

However the number of non-dominated points identified by these methods are nearly the

same. Interestingly, the non-dominant design points identified by the two methods only

partly overlap. Note that our definition of non-dominated point with respect to GA and

SA approach refer to those design points, that are not dominated by any other point seen

so far by the respective methods. Thus it is possible a point identified as non-dominated

in one approach may be dominated by design points identified in the other approach.

We observe that the non-dominated points from SA and GA are very close. This

can be observed from Figure 4.4 where the non-dominated points of GA and SA are

plotted together in one graph for two of the applications. Table 4.4 presents data on the

total number of non-dominated points obtained from GA and SA. The table presents the

number of common non-dominated points between GA and SA in column 4. These are

the same Pareto optimal design points identified by both GA and SA. The number of

unique non-dominated points represents the solutions that are globally non-dominated

but present only in one of GA/SA approaches. The presence of unique non-dominated

points in one approach means that this point is missing in the other approach. Column

6 reports the global non-dominated points, this is the sum of column 4 and column 5.

The ratio of column 6 to column 3, in a way represents the efficiency of an approach. We


Table 4.3: Memory Architecture Exploration

Application Time Taken GA SANo of Arch No of non No of Arch No of nonexplored dominated explored dominated

points pointsMpeg Enc 11 hours 8780 270 9172 287Vocoder 6 hours 7724 105 3850 104

Jpeg 3 hours 9266 89 2240 90DSL 6 hours 11240 133 8560 149

Table 4.4: Non-dominant Points Comparison GA-SA

Appli- Method Num of Common Unique global No of avg mincation non-dom ND pts ND pts ND pts Dominated dist from

(ND) pts uniquepoints NDs

Mpeg Enc GA 270 115 143 258 12 1.3%SA 287 115 99 214 73 2.1%

Vocoder GA 105 56 45 101 4 0.77%SA 104 56 22 78 26 3.1%

Jpeg GA 90 32 63 95 2 1.6%SA 89 32 2 34 55 2.1%

DSL GA 133 71 54 125 8 1.8%SA 149 71 34 105 44 5.5%

observe that the number of common points increases if time allotted to SA is increased.

Further, the column 7 reports the number of non-dominated points identified by one

method which gets dominated by points in the other method. This also is an indicator of

the efficiency of the approach: more the number of dominated points less the efficiency.

For example, for the MPEG encoder benchmark, 73 of the non-dominated design points

reported by SA are in fact dominated by certain design points seen by the GA approach.

As a consequence the global non-dominated points reduces to 214 for this benchmark. In

contrast, GA fares 270 non-dominated points of which 258 are globally non-dominated.

In fact this trend is observed almost for all benchmarks. Thus the data from experiments

point that GA performs a better job than SA. One concern that still remains is the set of


unique non-dominated points identified by SA but not by GA. If these design points are

interesting from a platform based design, then to be competitive the GA approach should

at least find a close enough design point. In order to analytically assess this we find

the minimum of the Euclidean distance between the each unique non-dominated point

reported by SA to all the non-dominated points reported by GA. The minimum distance

is normalised with respect to the distance between the unique non-dominated point to

the origin. This metric in some sense presents a close enough design point for each Pareto

optimal point missed by GA. If we could find an alternate non-dominated point in GA

at a very close distance to the unique non-dominated point reported by SA, then the

GA’s solution space can be considered as an acceptable superset. In column 8, we report

the average (arithmetic mean) minimum distance of all unique non-dominated points in

SA to the non-dominated points in GA. A similar metric is reported for the unique non-

dominated points identified by GA. We also report the maximum of the minimum distance

for all unique non-dominated points in column 9. The worst case average distance from

unique non-dominated points is 1.8% for GA and 5.5% for SA. Thus for every unique non-

dominated point reported by SA, the GA method can find a corresponding non-dominated

point within a distance of 1.8%.

In summary, we observe that GA finds more non-dominated points in general and

result in better solution quality for a given time. Only fewer non-dominated points of

GA are dominated by SA. Also, GA searches the design space more uniformly. This may

be due to the sharing and niche count based approach used in multi-objective GA which

facilitates better solution diversity and explores more number of Pareto-optimal points.

4.5.2.2 Memory Architecture Exploration Results

Figures 4.5, 4.6, 4.7 and 4.8 plots all the memory architectures explored for each of the 4

applications using the GA approach. The figure also plots the non-dominated points with

respect to memory cost and memory cycles. Note that each of the non-dominated point

is a Pareto optimal memory architecture for a given memory cost or memory cycles. The

results present 150-200 non-dominated solutions (that represents optimal architectures)


for each of the application.

Figure 4.5: Vocoder: Memory Exploration (All Design Points Explored and Non-dominated Points)

Figure 4.6: MPEG: Memory Exploration (All Design Points Explored and Non-dominatedPoints)


Figure 4.7: JPEG: Memory Exploration (All Design Points Explored and Non-dominatedPoints)

Figure 4.8: DSL: Memory Exploration (All Design Points Explored and Non-dominatedPoints)

4.6 Related Work

Broadly, there are two types of approaches that are attempted for memory design space

exploration: (i) Architecture Description Language (ADL) based approaches that uses

4.6 Related Work 89

simulation as a means to evaluate different design choices and (ii) exhaustive search or

evolutionary based approaches for memory architecture exploration with analytical model

based estimation to evaluate memory architectures.

There are architecture description language based approaches like LISA [56], EX-

PRESSION [46], and ISDL [32] that capture processor architecture details in a high

level language as front-end and uses a generator as back-end to generate ’C’ models that

simulate the processor architecture configuration. Specifically, LISA and EXPRESSION

captures the micro-architectural details of memory organization in a high level language

format. From the specification, both EXPRESSION and LISA generates ’C’ models that

simulate the memory behavior. To evaluate a specific memory configuration for the given

application, the application has to be compiled and run on the generated ’C’ model to get

the performance numbers in terms of number of memory stalls. LISA allows the flexibil-

ity to capture memory architecture details at different abstraction levels like functional

and cycle accurate specification. A functional ’C’ model will be 1-2 orders of magnitude

faster, in terms of run-time, as compared to a cycle accurate simulation model. Though

ADLs provide an elegant means to capture the memory architecture details and further

provide a platform to evaluate a given configuration by means of simulation, there are

some open issues that needs to be addressed. One, simulation is an expensive method

in terms of run-time and this limits the number of configurations that can be evaluated.

Two, the memory configurations needs to be fed as inputs manually. To evaluate signif-

icantly different memory organizations, developing the specification is a time consuming

task. Further, these methods do not address the problem of configuration selection itself.

Providing new configurations is a manual task and based on the designer, who modifies

the specification, the type of configurations evaluated could be different. While these

methods are very effective in evaluating a given memory architecture in an accurate way,

it is not suitable for exploring the design space with thousands of configurations because

of the following reasons: (i) for every configuration that needs to be explored, the input

specification needs to be modified and this is a manual process and (ii) since these are

simulation based approaches, even with a functional simulator, the number of architecture


configuration that can be evaluated for a large application is very limited because of the

large time taken by the simulator.

The second type of approach is estimation based methods. In [54], Panda et al., present

a heuristic algorithm for SPRAM-cache based local memory exploration. The objective of

this work is to determine the right size of on-chip memory for a given application. Their

algorithm partitions the on-chip memory into Scratch-pad RAM and data cache and also

computes the right line size for the data cache. This algorithm searches the entire memory

space to find the right combination of Scratch-Pad RAM, data cache and line size that

gives the best memory performance. This approach is very useful for architectures that

contain both SPRAM and Cache. Our work is different from this work in many aspects.

We address a different memory architecture class which consists of a on-chip SPRAM

with multiple SARAM banks and DARAM banks, but without cache memory. We have

proposed a two-level iterative approach for memory architecture exploration. The main

advantage of our method is it integrates data layout and memory exploration into one

problem. To the best of our knowledge there is no work which considers integration of

memory exploration and data layout as one single problem and optimises for performance

and area (memory cost). The memory exploration strategy presented in [64] explores the

design space to find the optimal configurations considering the cache size, processor cycles

and energy consumption. They propose an enumeration based search. Our approach on

the other hand uses evolutionary methods and is efficient, in terms of computation time,

in exploring complex memory architectures with multiple objectives. Also there are other

memory design space exploration approaches that considers a cache based target memory

architectures [9, 52, 51]. In this chapter, our work addresses the memory architecture

exploration of DSP memory architecture that is typically organized as multiple memory

banks where each of the banks can consist of single/dual port memories with different

sizes. We consider non-uniform memory bank sizes. Our work uses an integrated data-

layout and memory architecture exploration approach, which is key for guiding GA’s

search path in the right direction. The cost functions and the solution search space will

be very different for a cache based memory architecture used in [9, 52] and an on-chip

4.7 Conclusions 91

scratch pad based DSP memory architecture used in our paper. Although the approach

presented in this chapter does not address cache based architectures, we deal with this in

Chapter 7.

In summary the unique contributions of our work are the following: (a) integrating

memory architecture exploration and data layout in an iterative framework to explore

memory design space (b) addressing the class of memory architecture for DSPs that are

more complex and heterogeneous and (c) solving the design space exploration problem for

multiple objectives (memory architecture performance and memory area) and to obtain

a set of Pareto-optimal design solutions.

4.7 Conclusions

In this chapter we addressed the multi-level multi-objective memory architecture explo-

ration problem through a combination of evolutionary algorithms (for memory architec-

ture exploration) and an efficient heuristic data placement algorithm. More specifically,

for the outer level memory exploration problem, we have used two different evolutionary

algorithms (a) multi-objective Genetic Algorithm and (b) Simulated Annealing. We have

addressed two of the key system design objectives (i) performance in terms of memory

stall cycles and (ii) memory cost. Our approach explores the design space and gives a

few hundred Pareto Optimal memory architectures at various system design points in a

few hours of run time on a standard desktop. Each of these Pareto optimal design point

is interesting to the system designer from a platform based design view point. We have

presented a fully automated approach in order to meet the time-to-market requirements.

We extend the methodology to handle energy consumption, in Chapter 6.


Chapter 5

Data Layout Exploration

5.1 Introduction

In Chapter 3, we addressed the data-layout problem only from a performance (reducing

memory stalls) perspective. In this chapter, we explore the data-layout design space with

an objective to identify a set of Pareto-optimal data-layout solutions that are interesting

from performance and power viewpoint.

In the earlier chapters, we have solved the problem of data layout and memory architec-

ture design space exploration keeping the logical architecture viewpoint. This is because,

during system design, the software application developers work with logical memory ar-

chitecture, which specifies the logical structure of the memory architecture in terms of the

size of on-chip memory, the way the on-chip memory is organized in terms of the number

of memory banks, number of ports each of the memory banks have, the size of the ex-

ternal memory and the access latency for all these memories. These are the parameters

that impact the performance of the application and hence the software developers use this

logical view of memory architecture to optimize the data layout to extract the maximum

performance.

The hardware designers take the logical memory architecture specification as input

and design a physical memory architecture. This process is referred as memory alloca-

tion in the literature [35, 61]. Each of the logical memories is constructed with one or

94 Data Layout Exploration

more memory modules taken from a Semiconductor vendor memory library. For exam-

ple, a logical memory bank of 16KB×16 can be constructed with four 4KB×16 or eight

2KB×16 or eight 4KB×8 or sixteen 1KB×16 memory units. Each of these options, for

different process technology and different memory unit organization results in different

performance, area and energy consumption trade-offs. Hence the memory allocation pro-

cess is performed with the objective to reduce the memory area in terms of silicon gates,

and energy consumption. The memory allocation problem in general is NP-Complete [35].

Earlier approaches for the data layout step typically use a logical memory architecture

as input [10, 53] and as a consequence power consumption data for the memory banks is

not available. By considering the physical memory architecture, the data layout method

proposed in this chapter can optimize for power as well. Also, a common design assump-

tion in earlier design approaches [10] is that, for data layout, the power and performance

are non-conflicting objectives and therefore optimizing performance will also result in

lower power. However we show that this assumption in general is not valid for all classes

of memory architectures. Specifically, we here show that for DSPs memory architectures,

power and performance are conflicting objectives and there is a significant trade-off (up to

70%) possible. Hence this factor needs to be carefully factored in the data layout method

to choose an optimal power-performance point in the design space.

When we extend these problems taking the physical memory architecture into account,

there are two possible approaches. One approach is to solve the data layout and memory

architecture exploration problem for logical memory architecture, as described in the

previous chapters and then map the logical memory architecture to physical memory

architecture. Alternatively the above problem can be directly solved for physical memory

architecture. We evaluate both these approaches and demonstrate the latter is more

beneficial. In this process, we develop a comprehensive automatic memory architecture

exploration framework, which can explore logical and physical memory architecture. We

do this in a systematic way, first addressing the data layout problem for physical memory

architecture in this chapter. The following chapter deals with the memory architecture

exploration problem considering the physical memory architecture.

5.2 Problem Definition 95

In this chapter we propose MODLEX framework, a Multi Objective Data Layout

EXploration based on Genetic Algorithm that explores the data layout design space for

a given logical memory architecture which is mapped to a physical memory architecture

and obtains a list of Pareto-optimal data layout solutions from a performance and power

perspectives. In other words, the MODLEX approach identifies the best data layouts

for a given physical memory architecture (implementing logical memory architecture) to

identify design points that are interesting from power and performance viewpoint.

The main contributions of this chapter are (a) Combining different data layout opti-

mizations into a unified framework that can be used for the complex embedded DSP mem-

ory architectures. Even though we target the DSP memory architectures, our method also

works for microcontrollers as well. (b) Model the data layout problem as multi-objective

Genetic Algorithm (GA) with performance and power being the objectives. Our method

optimizes the data layout for power and run-time and presents a set of solution points

that are optimal with respect to power and performance. (c) Most of the work in the

literature assumes that performance and power are non-conflicting objectives with respect

to data allocation. But we show that there is significant trade-off (up to 70%) that is

possible between power and performance.

The remainder of this chapter is organized as follows. Section 5.2 deals with the

problem definition. In Section 5.3, we present our MODLEX framework. In Section 5.4,

we describe the experimental methodology and report our experimental results. In Section

5.5, we present the related work. Finally concluding remarks are provided in Section 5.6.

5.2 Problem Definition

We are given a logical memory architecture Me with m on-chip SARAM memory banks,

n on-chip DARAM memory banks, and an off-chip memory and the memory access char-

acteristics of the application in terms of the conflict matrix defined in Section 3.2.1, which

specifies the number of concurrent accesses between a pair of data sections i and j and

self-conflicts, and the frequency of access of individual data sections. The problem on


hand is to realise the logical memory architecture Me in terms of physical memory mod-

ules, available in the ASIC memory library for a given technology or process node and

obtain a suitable data layout for the physical memory architecture Mp such that the num-

ber of memory stalls incurred and the energy consumed by the memory architecture are

minimised.

More specifically, we consider the data layout problem for physical memory architec-

ture with the following two objectives.

• Number of memory stalls incurred due to conflicting accesses (parallel and self

conflicts) and the additional cycles incurred in accessing off-chip memory

• The total memory power calculated as the sum of the memory power of all memory

banks for various memory accesses. Memory power of each of the banks is computed

by multiplying the number of read/write accesses (based on the data placed in the

bank) and the power per read/write access for the specific memory module accesses

We defer the consideration of memory area optimization from a physical memory

architecture exploration perspective to the following chapter.

5.3 MODLEX: Multi Objective Data Layout EXplo-

ration


We formulate the data layout problem as a multi-objective GA [30] to obtain a set of

Pareto optimal design points. The multiple objectives are minimizing memory stall cy-

cles and memory power. Figure 5.1 illustrates our MODLEX (Multi Objective Data

Layout EXploration) framework, which takes application profile information and a logical

memory architecture as inputs. The logical memory architecture, as explained in Chapter

4 contains the number of memory banks, memory bank sizes, memory bank types (single-

port, dual-port), and memory bank latencies. The logical memory to physical memory

5.3 MODLEX: Multi Objective Data Layout EXploration 97

map is obtained using a greedy heuristic method which is explained in the following sec-

tion. The core engine of the MODLEX framework is the multi-objective data layout,

which is implemented as a Genetic Algorithm (GA). The data layout block takes the

application data and the logical memory architecture as input and outputs a data place-

ment. The cost of data placement in terms of memory stalls is computed as explained

in Chapter 3. To compute the memory power, we use the physical memory architecture

and use the power per read/write obtained from the ASIC memory library. The memory

power computation is further explained in Section 5.3.3.3. The overall fitness function

used by the GA is a combination of memory stall cost and memory power cost. Based on

the fitness function, the GA evolves by selecting the fittest individuals (the data place-

ments with the lowest cost) to the next generation. To handle multiple objectives, the

fitness function is computed by ranking the chromosomes based on the non-dominated

criteria (as explained in Section 2.5). This process is repeated for a maximum number of

generations specified as a input parameter.

Figure 5.1: MODLEX: Multi Objective Data Layout EXploration Framework


5.3.2 Mapping Logical Memory to Physical Memory

To get the memory power and area numbers, the logical memories have to be mapped

to physical memory modules available in a ASIC memory library for a specific technol-

ogy/process node. As mentioned earlier each of the logical memory bank can be imple-

mented physically in many ways. For example, for a logical memory bank of 4K*16 bits

can be formed with two physical memories of size 2K*16 bits or four physical memories of

size 2K*8 bits. Different approaches have been proposed for mapping logical memory to

physical memories [35, 61]. The memory mapping problem in general is NP-Complete [35].

However since the logical memory architecture is already organized as multiple memory

banks, most of the mapping turns out to be a direct one to one mapping. In this chapter

a simple greedy heuristic is used to perform the mapping of logical to physical memory

with the objective of reducing silicon area. This is achieved by first sorting the memory

modules based on area/byte and then by choosing the smallest area/byte physical mem-

ory to form the required logical memory bank size. Though this heuristic is very simple,

it results in an efficient physical memory architecture. Further in the following chapter,

we consider the exploration of physical memory architecture with an added objective of

area optimization.

5.3.3 Genetic Algorithm Formulation

To map the data layout problem to the GA framework, we use the chromosomal repre-

sentation, fitness computation, selection function and genetic operators defined in Section

3.4. For easy reference and completeness, we briefly describe them in the following sub-

sections.

5.3.3.1 Chromosome Representation

For the data memory layout problem, each individual chromosome represents a memory

placement. A chromosome is a vector of d elements, where d is the number of data sections.

Each element of a chromosome can take a value in (0 .. m), where 1..m represent on-

chip logical memory banks (including both SARAM and DARAM memory banks) and 0

5.3 MODLEX: Multi Objective Data Layout EXploration 99

represents off-chip memory. For the purpose of data layout it is sufficient to consider the

logical memory architecture from which the number of memory stalls can be computed.

However, for computing the power consumption for a given placement done by data layout,

the corresponding physical memory architecture obtained from our heuristic mapping

algorithm, need to be considered. Thus if element i of a chromosome has a value k, then

the data section is placed in memory bank k. Thus a chromosome represents a memory

placement for all data sections. Note that a chromosome may not always represent a valid

memory placement, as the size of data sections placed in a memory bank k may exceed

the size of k. Such a chromosome is marked as invalid and assigned a low fitness value.

5.3.3.2 Chromosome Selection and Generation

The strongest individuals in a population are used to produce new off-springs. The

selection of an individual depends on its fitness, an individual with a higher fitness has

a higher probability of contributing one or more offspring to the next generation. In

every generation, from the P individuals of the current generation, M new offsprings are

generated, resulting in a total population of (P + M). From this P fittest individuals

survive to the next generation. The remaining M individuals are annihilated. Crossover

and mutation operators are implemented as explained in Section 3.4.

5.3.3.3 Fitness Function and Ranking

For each of the individuals corresponding to a data layout the fitness function computes

the power consumed by the memory architecture (Mpow) and the performance in terms of

memory stall cycles (Mcyc). The value of Mcyc computation is similar to the cost function

used in our heuristic algorithm described in Section 3.5 which is explained briefly below.

The number of memory stalls incurred in a memory bank j can be computed by summing

the number of conflicts between pairs of data sections that are kept in j. For each pair of

the conflicting data sections, the number of conflicts is given by the conflict matrix. Thus

the number of stalls in memory bank j is given by ΣCx,y, for all (x, y) such that data

sections x and y are placed in memory bank j. As DARAM banks support concurrent


accesses, DARAM bank conflicts Cx,y between data section x and y placed in a DARAM

bank, as well self conflicts Cx,x do not incur any memory stalls. Note that our model

assumes only up to two concurrent accesses in any cycle. The total memory stalls incurred

in bank j can be computed by multiplying the number of conflicts and the bank latency.

The total memory stalls for the complete memory architecture is computed by summing

all the memory stalls incurred by all the individual memory banks.

Memory Power corresponding to a chromosome is computed as follows. Assume each

logical memory bank j is mapped to a set of physical memory banks mj,1,mj,2, ...mj,nj.

If Pj,k is the power per read/write accesses of memory module mj,k and AFi,j,k is the

number of accesses to data section i that map to physical memory bank mj,k, then the

total power consumed is given by

Ponchip =∑

i

∑

j

∑

k

AFi,j,k × Pj,k (5.1)

Note that AFi,j,k is 0 if data section i is either not mapped to logical memory bank

j, or if i not mapped to physical memory bank k. Also, AFi,j,k and AFi,j,k‘ would both

account for an access to data section i that is mapped to logical memory bank j, when

j is implemented using multiple banks k and k‘. For example, logical memory bank of

2K × 16 implemented using two physical memory modules of size 2K × 8.

Thus the total power Mpow for all the memory banks including off-chip memory is

given by

Mpow = Pon−chip +∑

i

AFi,off ∗ Poff

where AFi,off represents the number of access to off-chip memory from data section

i, and Poff is power per access for off-chip memory.

Once the memory cost and memory cycles are computed for all the individuals in

the population, individuals are ranked according to the Pareto optimality conditions on

power consumption (Mpow) and performance in terms of memory stall cycles (Mcyc). More

specifically, if (Mapow,Ma

cyc) and (M bpow, M b

cyc) are the memory power and memory cycles


of chromosome A and chromosome B, A is ranked higher (i.e.,has a lower rank value)

than B if

((Mapow < M b

pow) ∧ (Macyc ≤ M b

cyc)) ∨((Macyc < M b

cyc) ∧ (Mapow ≤ M b

pow))

The ranking process in a multi-objective GA proceeds in the non-dominated sorting

manner as described in Section 4.3. All non-dominated individuals in the current popula-

tion have a rank value 1 and are flagged. Subsequently rank-2 individuals are identified as

the non-dominated solutions in the remaining population. In this way all chromosomes in

the population get a rank value. Higher fitness values are assigned for rank-1 individuals

as compared to rank-2 and so on. This fitness is used for the selection probability. The

individuals with higher fitness gets a better chance of getting selected for reproduction.

To ensure solution diversity which is very critical for getting a good distribution of

solutions in the Pareto-optimal front, the fitness value is reduced for a chromosome that

has many neighboring solutions. This is accomplished as explained in Section 4.3.

The GA must be provided with an initial population which is created randomly. In our

implementation we have used a fixed number of generations as the termination criterion.

As the GA approach evolves different generations, non-dominated solutions, which are

Pareto-optimal (data layout) in terms of performance and power are saved in a database.

5.4 Experimental Results


We have used the same set of benchmark programs and profile information as in the earlier

chapters. For performing the memory allocation step, we have used TI’s ASIC memory

library. The area and power numbers are obtained from the ASIC memory library.

We consider a set of 6 different logical memory architecture listed in Table 5.1. The

corresponding physical memory architecture and the normalized area1 required by the

1As the ASIC library is proprietary to Texas Instruments, we present only the normalized power and


physical memory for the different architectures are also given in Table 5.1. Note that the

memory size used for each of the memory architectures is 96KB and this is enough to

fit the data of each of the applications considered. Further, the architectures A1 to A5

are sorted based on physical memory area in descending order. Architecture A6 will be

used in Section 5.4.3 for comparison with other related work. In Table 5.1, column 3, the

physical memory banks with symbols 1P and 2P represent respectively single and dual

port memory banks. Architectures A1 to A5 are selected such that the memory config-

uration in terms of multiple memory banks and the bank types (SARAM and DARAM)

is varied. In all of these configurations, the data width is 16-bit in both the logical archi-

tecture and physical memory banks. From the table it can be observed that the memory

area increases with the DARAM size and the number of banks. A1 has the highest num-

ber of memory banks with largest DARAM size; hence A1 consumes the largest area.

A2 and A3 has the same DARAM size but the SARAM configuration is different. A3

and A4 present a non-uniform bank size based SARAM architecture. Non-uniform bank

size based architectures allows the usage of memory banks with multiple sizes and hence

presents opportunities to optimize memory area and power consumption. Larger memory

banks optimizes area, whereas smaller memory bansk reduces power consumption. A5

has the least number of memory banks and uses larger memories with a reduced memory

area. In summary, we would expect architecture A1 to perform very well in terms of

performance because of its large DARAM memory and architecture A4 to perform better

in terms of power consumption because of its lesser DARAM size and the presence of

non-uniform bank sizes. Note that the architecture A4 has more memory area than A5

even though it has only half of the A5’s DARAM. This is due to the higher number of

banks in A4. Note that A6 has the lowest area because it has 32KB of off-chip RAM and

off-chip memory is not included in the area.

area numbers.


Table 5.1: Memory Architectures Used for Data Layout

Arch no Architecture MemoryLogical Physical AreaMemory Memory

A1 2x8K(SARAM) 2x8192(1P) 15x16K(DARAM) 20x4096(2P)

A2 16x4K(SARAM) 16x4096(1P) 0.9132K(DARAM) 8x4096(2P)

A3 8x4K(SARAM) 8x4096(1P) 0.821x32K(SARAM) 1x32K(1P)8x4K(DARAM) 8x4096(2P)

A4 8x2K(SARAM) 8x2048(1P) 0.774x4K(SARAM) 4x4096(1P)3x16K(SARAM) 3x16K(1P)16K(DARAM) 4x4096(2P)

A5 64K(SARAM) 2x32K(1P) 0.7232K(DARAM) 8x4096(2P)

A6 8x2K(SARAM) 8x2048(1P) 0.574x4K(SARAM) 4x4096(1P)16K(SARAM) 1x16K(1P)16K(DARAM) 4x4096(2P)

1x32K(Off-Chip) 1x32K(SDRAM)


5.4.2 Experimental Results

This section presents the experimental results on the multi-objective data layout for phys-

ical memory architecture. Figures 5.2, 5.3, and 5.4 shows the set of non-dominated points

each corresponds to a Pareto Optimal data layout from a power and performance view-

point for the 3 applications for architectures A1-A5. Note that architectures A1 to A5

correspond to a fixed physical memory architecture with a known silicon area. Figure 5.2

presents the different data layout solution space from a power consumption and perfor-

mance (memory stalls) view point. Note that each of the point in the plot represents a

data layout for a given architecture. Observe that there are several data layout solutions

presented for each of the architectures considered.

Figure 5.2: Data Layout Exploration: MPEG Encoder

It should be noted that the non-dominated points seen by the multi-objective GA

are only near optimal, in the sense that these are the non-dominated points among the

solutions seen so far. As the evolutionary method may result in a design point that

could dominate an existing non-dominated solution. However we choose the number of

generations in our method in such a way that increasing the number of iterations does


Figure 5.3: Data Layout Exploration: Voice Encoder

Figure 5.4: Data Layout Exploration: Multi-Channel DSL


Figure 5.5: Individual Optimizations vs Integrated

not result in any new non-dominated points.

Note that solutions points corresponding to architecture A1 gives better performance

(lesser memory stalls), and observe that there is a solution point (data layout) that

resolves all the memory stalls for MPEG. This is because of the large size of DARAM

in A1. However, most of the solution points of A1 consume more power, again due to

the large DARAM size. Solution points in A2 follows very closely the solution points in

A1. Observe that A2’s solution points is only slightly inferior in terms of performance to

that of A1’s solution points, even though A2 has only less than half the size of DARAM

as A1. Also, A2’s solution points dominate (performs better both in terms of power and

performance, also note that memory area of A2 is lower than memory area of A1) most

of the solution points of A1 in the low performance region. Hence, it can be deduced that

MPEG does not gain in terms of performance by having larger DARAM memory. And

at the same time the larger DARAM of A1 decreases the power efficiency of the data

layouts. Interestingly, even A3’s solution points, in lower performance region, dominates

A1’s solution points. Solution points corresponding to A4 gives the best solution points


in terms of power and performance in the mid-region. Observe that A4’s solution points

dominates all the rest of the solution points in the mid region. A5’s solution points are

notably inferior as compared to the rest of the solutions. This is because of the fewer

number of memory banks in A5. From this, it can be deduced that MPEG performs

multiple simultaneous memory access and thus, for MPEG, multiple memory banks is

more important than DARAM banks for achieving better solution points.

Figure 5.3 presents the results for the Voice Encoder application for the 5 different

architectures A1-A5. Unlike MPEG, the solution points of A1 are clearly superior here,

mainly in terms of performance. Observe that the solution points of the architectures A1,

A2 and A4 dominate some of the power-performance regions in the data layout space.

Solutions of A1 dominate the high performance space, solutions of A2 and A4 dominate

the middle space both in performance and power, and again solutions of A2 dominates

the low power-performance region. From the results, it can be deduced that for voice

encoder, DARAM and multiple memory banks both are equally critical. With only a

small increase in area compared to A5, A3 achieves much better performance than A5.

This is due to the higher number of banks in A3 that resolves more parallel conflicts.

Figure 5.4 presents the results for the Multi-channel DSL application for the 5 differ-

ent architectures A1-A5 described in Table 5.1. Observe that all the architecture gives a

solution point with near zero memory stalls. This indicates that the application does not

require more that 16K of DARAM (this is the smallest size of DARAM used among all

architectures A1-A5 and is used in A4). Also, it can be deduced that this application does

not need more than 3 banks to resolve all the parallel conflicts (note that A5 has only 3

number of banks). A significant portion of the DSL application was developed in ’C’ lan-

guage and this is one reason for the lesser number of parallel and self conflicts. Typically,

hand optimized assembly code will try to exploit the DSP architectures by using multiple

simultaneous accesses and self accesses. However, compiler generated assembly code may

not be as efficient as hand-optimized code, mainly in terms of parallel memory accesses.

Interestingly, the solution points of A4 dominates most of the other solution points. This

is mainly due to the non-uniform banks sizes of A4 that presents opportunities for data


layout to optimize and trade-off power and performance.

Also observe the wide range of trade-off available between power and performance

for all the applications for each of the architectures. This is very useful for application

engineers and system designers from a platform design viewpoint. Also, these different

power-performance operating points are very essential for SoC’s that have Dynamic Volt-

age and Frequencies Scaling (DVFS) [23]. DVFS presents different operating points for a

SoC to gain power based on use cases. For example, in a mobile phone, a stand-alone MP3

play may not require much performance. However, MP3 play while shooting a still picture

will consume more performance. DVFS allows different operating points for stand-alone

MP3 player and MP3 with Camera. MP3 player can be operated with processor running

at 80Mhz at 1.2V, whereas MP3 with Camera will need more performance and can be

operated with processor running at 120Mhz at 1.45V. Hence, we will need two different

data layouts for MP3 for the above two operating points.

Next, we report the execution time required by our multi-objective GA to obtain the

data layouts. The GA is run on a Pentium4 desktop at 1.7GHz. It takes 26 to 31 minutes

to obtain all Pareto Optimal design points for a single architecture. This run-time is

approximately same for all the application.

5.4.3 Comparison of MODLEX and Stand-alone Optimizations

In this section we present results on all the stand-alone optimizations and compare it with

our integrated approach, MODLEX, where all the optimizations are considered together.

For this purpose we consider the following optimzations.

O1 Optimization O1 corresponds to performing just on-chip/off-chip data partition, sim-

ilar to the approach proposed in [10, 67]

O2 Optimization O2 corresponds to performing O1 and also resolving parallel memory

conflicts by utilizing only multiple memory banks [44, 40]

O3 Optimization O3 corresponds to the MODLEX approach that integrates O1 and O2,

and resolves self-conflicts and also exploits non-uniform sized memory banks

5.5 Related Work 109

Figure 5.5 presents the results for MPEG for the memory architecture A6 explained

in Table 5.1. There are six different plots and each plot represents a specific data layout

optimization. Note that the plots O1 and O2 uses only the optimizations O1 and O2

respectively, In comparison using our MODLEX framework and optimization O3, presents

different solution points from a power and performance view point. Observe that for the

same memory architecture, the MODLEX approach presents a wide range of solutions

starting from the high performance region that resolves almost all the memory stalls to the

low performance region. Note that from power and performance perspective, the solution

points in the integrated approach completely dominates the solution points in the other

two plots. Methods like [10, 67] will give power/performance close to point P1 and the

point P2 corresponds to the works [44, 40, 58] and the data layout that optimizes power

[15] is represented by point P3. From the results we can conclude that our integrated

approach gives better solution points both with respect to power and performance. Also

from the experimental results it can be concluded that there is a wide range of design

points with respect to power and performance that can be obtained from multi-objective

data layout optimizations. The computation cost involved in our approach is very small

, less than an hour on a standard desktop.

5.5 Related Work

The data layout problem [10, 15, 40, 44, 53, 67] has been widely researched in the liter-

ature from either a performance or power perspective individually. In [18], a low-energy

memory design method is proposed, referred as VAbM, that optimizes the memory area by

allocating multiple memory banks with variable bit-width to optimally fit the application

data. In [15], Benini et al., present a data layout method that aims at energy reduction.

The main idea of this work is to use the access frequency of memory address space as

starting point and design smaller (larger) bank size for the most (least) frequently ac-

cessed memory addresses. In [40], the authors present a heuristic algorithm to efficiently

partition the data to avoid parallel conflicts in DSP applications. Their objective is to

partition the data into multiple chunks of same size so that they can fit in a memory


architecture with uniform bank sizes. This approach works well if we consider only per-

formance as an objective. However, if the objective is to optimize both performance and

power, then a memory architecture with non-uniform banks is very attractive.

All the above optimizations are very effective individually for the class of memory ar-

chitecture they target. However a complete data layout approach has to combine many/all

of these approaches to be able to comprehensively address the problem. Also it may not

be optimal to just combine different optimizations compared to an integrated approach

which is likely yield a better result. Our MODLEX framework accomplishes this. Further,

our data layout approach can effectively partition data to resolve parallel conflicts and

also exploit the advantage from non-uniform bank architecture to save power. To the best

of our knowledge there is no work done in the literature to address this problem.

5.6 Conclusions

In this chapter we presented our framework MODLEX, a Multi Objective Data Layout

EXploration for physical memory architecture. Our approach results in many data layout

that are Pareto-optimal with respect to power and performance which are important from

a platform design view point. We demonstrated that there is significant trade-off (up to

70%) that is possible between power and performance. In the next chapter we extend our

framework to explore memory architecture design space along with the data layout.

Chapter 6

Physical Memory Exploration

6.1 Introduction

As discussed in Chapter 2, at the memory exploration step, a memory architecture is

defined which includes determining the on-chip memory size, the number and size of each

memory bank in SPRAM, the number of memory ports per bank, the types of memory

(scratch pad RAM or cache), and the wait-states/latency. This architecture was referred

to as logical memory architecture in Chapter 4 as it does not tie the architecture at

this point to a specific ASIC memory library module. We proposed a logical memory

exploration (LME) framework in Chapter 4 for identifying Pareto optimal design points

in terms of performance and area. At the memory architecture design stage, each of

these logical memory architectures under consideration has to be mapped to a physical

memory architecture. Alternatively, one can directly explore the space of physical memory

architecture, taking into consideration the different memory banks/modules available in

the semi-conductor vendor memory library. The physical memory architecture proposed

for a given application is once again evaluated from performance, power and cost (memory

area) view point and a list of Pareto optimal design points are obtained that are interesting

from a platform design perspective. In this chapter we evaluate these two approaches.

This is explained in greater detail below.

• The first approach is an extension of the Logical Memory Exploration (LME)

112 Physical Memory Exploration

Figure 6.1: Memory Architecture Exploration

method described in Chapter 4. The output of LME is a set of design points (Log-

ical Memory Architectures) that are Pareto optimal with respect to performance

and (logical) memory cost. Also, as a part of LME, the data layout generates data

placement details for each of the logical memory architecture has been explored in

LME. The non-dominated points from LME and the placement details for each of

the non-dominated point form inputs to physical memory architecture exploration

step. The mapping of logical memory architecture to physical memory architecture

is formulated as a Multi-Objective Genetic Algorithm to explore the design space

with power and area as the objectives. Area and power numbers of these physical

memory modules are obtained from a semi-conductor vendor memory library. The

physical memory exploration step is performed for every non-dominated point from

LME. Note that the performance was one of the objectives at LME and this does

not change during physical memory exploration step. Hence at the output of phys-

ical memory exploration approach, for every non-dominated point generated from

LME, a set of non-dominated points are identified that are optimal with respect to

power and area. We refer to this approach as LME2PME.

• Second approach is a direct approach for Physical Memory Exploration (PME). In

this approach we integrate three critical components together: (i) memory archi-

tecture exploration, (ii) memory allocation, which constructs a logical memory by

picking memory modules from a semi-conductor vendor memory library and (iii)

6.1 Introduction 113

data layout exploration, this module is critical for estimating performance. The

memory allocation step is critical and influences power/read and power/write as

well as memory area for all the memory modules. This integrated approach is

shown in Figure 6.2. For memory architecture exploration, we use a multi-objective

non-dominated sorting Genetic Algorithm approach [25]. For the data layout prob-

lem which needs to be solved for each of the 1000s of memory architectures, we

use the fast efficient heuristic method described in Section 3.5. For the memory

allocation, we use an exhaustive search algorithm. Thus the overall framework uses

a two level iterative approach with memory architecture exploration and memory

packing at the outer level and data layout at the inner level. We propose a fully

automated framework for this integrated approach and we refer the framework as

DirPME.

Figure 6.2: Memory Architecture Exploration - Integrated Approach

Thus, the main contribution of this chapter is a two-pronged approach to physical

memory architecture exploration. Our method optimizes the memory architecture, for a

given application, and presents a set of solution points that are optimal with respect to

performance, power and area.

The remainder of this chapter is organized as follows. In Section 6.2, we present the

LME2PME approach. In Section 6.3, deals with our DirPME framework. In Section 6.4,

we present the experimental methodology and results for both LME2PME and DirPME


frameworks. Section 6.5 covers some of the related work from the literature. Finally in

Section 6.6, we conclude by summarizing our work in this chapter.

6.2 Logical Memory Exploration to Physical Mem-

ory Exploration (LME2PME)


The LME2PME method extends the Logical Memory Exploration (LME) process de-

scribed in Chapter 4 by considering memory power and memory area in addition to the

memory performance objective addressed by the LME. Note that the LME works on min-

imizing the number of memory stalls for a given logical memory cost. The logical memory

cost is a factor proportional to memory area. LME, for a given application, finds a list of

Pareto optimal logical memory architectures considering performance and logical memory

area as the objective criterion. This is shown in the top right portion of Figure 6.3.

At the LME step, the memory is not mapped to physical modules and hence the

actual silicon area and power consumption numbers are not known. Also, for a given

logical memory architecture there are many possible ways to implement the actual physical

memory architecture. As shown in Figure 6.3, the non-dominated points from LME are

taken as inputs for the memory allocation exploration step. The output of Physical

Memory Exploration is a set of Pareto optimal points with the memory power, memory

area and memory stalls as the objective criteria. For each non-dominated logical memory

architecture generated by LME, there are multiple physical memory architectures with

different power-area operating points with the same memory stalls. This is shown in

Figure 6.3, where the design solution LM1 in LME’s output corresponds to a memory

stall of ms1 and generates a set of Pareto optimal points (denoted by PM1s in the

lower half of Figure 6.3) with respect to memory area and memory power. Similarly,

LM2 which incurs a memory stall of ms2 results in a set of PM2s of physical memory

architectures. Note that ms1 and ms2, which are the memory stalls as determined by

6.2 Logical Memory Exploration to Physical Memory Exploration(LME2PME) 115

LME, does not change during the Physical Memory Exploration step. Different physical

memory architectures are explored with different area-power operating points for a given

memory performance.

Figure 6.3: Logical to Physical Memory Exploration - Overview

6.2.2 Physical Memory Exploration

In a traditional HW-SW codesign method, once the logical memory architecture is final-

ized, the HW and SW designs happen independently. The SW design teams focus on

performance optimization of application with the given logical memory architecture and

the H/W design teams focus on area optimization during the memory allocation step.

But in the process, power optimization, which requires both H/W and S/W perspective,

is not considered. Our LME2PME method addresses this problem by taking the required

inputs from the datalayout step that helps in optimizing power consumption and also

optimizing area at the same time during memory allocation. Figure 6.4 describes the

LME2PME method.


The top part of Figure 6.4, the logical memory architecture exploration (LME) is same

as what was described in Chapter 4. The bottom part of Figure 6.4 shows the physical

memory exploration. As shown in Figure 6.4, the Physical Memory Exploration (PME)

step takes two inputs from the LME. The first input is the set of Non-dominated points

generated by LME and the second input is the data placement details, which is the output

of the data layout step and provides information on what data-section is placed in which

memory bank. From the data placement and the profile data, the PME computes the

number of memory accesses per logical memory bank. This is an important information

and this can be used to decide on using larger or smaller memories while mapping a

logical memory bank. As discussed in Chapter 2, a smaller memory consumes less power

per read/write access as compared to a larger memory. Hence, if a logical memory bank

is known to have a data that is accessed higher number of times, it is power-optimal

to design this logical memory bank with many smaller physical memories. However this

comes with a higher silicon area cost and hence results in a area-power trade-off.

We formulate the memory allocation exploration as a Multi-Objective Genetic Algo-

rithm problem. To map an optimization problem to the GA framework, we need the

following: chromosomal representation, fitness computation, selection function, genetic

operators, the creation of the initial population and the termination criteria. Figure 6.5

explains the GA formulation of the Physical Memory Mapping problem.



Each individual chromosome represents a physical memory architecture. As shown in

Figure 6.5, a chromosome consists of a list of physical memories picked from a ASIC

memory library. These list of physical memories are used to construct a given logical

memory architecture. Typically multiple physical memory modules are used to construct

a logical memory bank. As an example, if the logical bank is of size 8K*16bits then, the

physical memory modules can be two 4K*16bits or eight 2K*8bits or eight 1K*16bits and

so on. We have limited the number of physical memory modules per logical memory bank


Figure 6.4: Logical to Physical Memory Exploration - Method

to at most k. Thus, a chromosome is a vector of d elements, where d = Nl ∗ k + 1 and

Nl is the number of logical memory banks, which is an input from LME. Note that each

of the element represents an index in the semiconductor vendor memory library which

corresponds to a specific physical memory module.

For decoding a chromosome, for each of the Nl logical banks, the chromosome has k

elements. As mentioned earlier each of the k element is an integer used to index into

semiconductor vendor memory library. With the k physical memory modules, a logical

memory bank is formed. We have used a memory allocator that performs exhaustive

combinations with the k physical memory modules to get the largest logical memory re-

quired with the specified word size. Here, the bank size, the word size and the number

of ports are obtained from the logical memory architecture, corresponding to the chosen

non-dominated point. In this process, it may happen that m out of the total k physical


Figure 6.5: GA Formulation of LME2PME

memories selected may not be used, if with k−m physical memories the given logical mem-

ory bank can be constructed1. For example, if k=4, and if the 4 elements are 2K*8bits,

2K*8bits, 1K*8bits, and 16K*8bits and if the logical memory bank is 2k*16bits, then our

memory allocator builds a 2K*16bits logical memory bank from the two 2K*8bits and

the remaining two memories are ignored. Note that the 16K*8bit memory and 1K*8bit

memory is removed from the configuration as the logical memory bank can be constructed

optimally with the two 2K*8bit memory modules. Here, the memory area of this logical

1This approach of using only the required k−m physical memory modules relaxes the constraint thatthe chromosome representation has to exactly match a given logical memory architecture. This, in turn,facilitates the GA approach to explore many physical memory architecture efficiently.


memory bank is the sum of the memory area of the two 2K*8bit physical memory mod-

ules2. This process is repeated for each of Nl logical memory banks. The memory area of

a memory architecture is the sum of the area of all the logical memory banks.



selection of an individual depends on its fitness; an individual with a higher fitness has

a higher probability of contributing one or more offsprings to the next generation. In

every generation, from the P individuals of the current generation, M new offsprings

are generated using mutation and crossover operators, resulting in a total population

of (P + M). The crossover operation is performed as illustrated in Figure 6.5. From

this total population of (P + M), P fittest individuals survive to the next generation.

The remaining M individuals are annihilated. Crossover and mutation operators are

implemented in standard way.


For each of the individuals, the fitness function computes Marea and Mpow. Note that

Mcyc is not computed as it is already available from LME. The Marea is obtained from

the memory mapping block, which is the sum of area of all the physical memory modules

used in the chromosome. Mpow is computed based on two factors: (a) access frequency

of data-sections and the data-placement information and (b) power per read/write access

information derived from the semiconductor vendor memory library for all the physical

memory modules.

To compute the memory power the method uses the data layout information provided

by the LME step. Based on the data layout, and the physical memories required to form

the logical memory (obtained from the chromosome representation), the accesses to each

data section is mapped to the respective physical memories. From this, the power per

2Although the chromosome representation may have more physical memories than required to con-struct the given logical memory, the fitness function (area and power estimates) is derived only for therequired physical memories.


access for each physical memory, and the number of accesses to the data section, the

total memory power consumed for all accesses to a data section is determined. From this,

the total memory power consumed by the entire application on a given physical memory

architecture is computed by summing the power consumed by all the data sections.

Once the memory area, memory power and memory cycles are computed for all the

individuals in the population, individuals are ranked according to the Pareto optimal-

ity conditions given in the following equation, which is similar to the Pareto optimal-

ity condition discussed in Chapter 4, but considers all three objective functions. Let

(Mapow,Ma

cyc,Maarea) and (M b

pow,M bcyc,M

barea) be the memory power, memory cycles and

memory area of chromosome A and chromosome B. A dominates B if the following

expression is true.

(((Mapow < M b


cyc) ∧ (Maarea ≤ M b

area))

∨((Macyc < M b


pow) ∧ (Maarea ≤ M b

area))

∨((Maarea < M b

area) ∧ (Macyc ≤ M b


pow)))

For ranking of the chromosomes, we use the non-dominated sorting process described

in Section 4.3. The GA must be provided with an initial population that is created

randomly. In our implementation we have used a fixed number of generations as the

termination criterion.

6.3 Direct Physical Memory Exploration (DirPME)

Framework


In the LME2PME approach described in the previous section, the physical memory ex-

ploration is done in two steps. In this section we describe the DirPME framework that

6.3 Direct Physical Memory Exploration (DirPME) Framework 121

directly operates in the physical memory design space.

Figure 6.6: MAX: Memory Architecture eXploration Framework

Figure 6.6 explains our DirPME framework. The core engine of the framework is

the multi-objective memory architecture exploration, which takes the application data

size and semi-conductor vendor memory library as inputs and forms different memory

architectures. The memory allocation procedure builds the complete memory architecture

from the memory modules chosen by the exploration block. If the memory modules

together does not form a proper memory architecture, the memory allocation block rejects

the chosen memory architecture as invalid. Also the memory mapping allocation checks

the access time of the on-chip memory modules and rejects those whose cycle time is

greater than the required access time. The exploration process using the genetic algorithm

and the chromosome representation is discussed in detail in the following section. Once

the memory modules are selected the memory mapping block computes the total memory

area, which is the sum of all the individual memory modules.

Details on the selected memory architecture, like the on-chip memory size, number of

memory banks, number of ports, off-chip memory bank latency are passed to the data

layout procedure. The application data buffers and the application profile information also


given as inputs to the data layout. The application itself consists of multiple modules,

including several third-party IP modules as shown in Figure 6.6. With these inputs

the data layout maps the application data buffers to the memory architecture; the data

layout heuristic is the same as explained in Section 3.5. The output of data layout is

a valid placement of application data buffers, from the data layout, and the application

memory access characteristic the memory stalls are determined. The memory power is

also computed using the application characteristic and power per access available from the

semi-conductor vendor memory library. Lastly, the memory cost is computed by summing

the cost of the individual physical memories. Thus the fitness function for the memory

exploration is computed with the memory area, performance and power.

Based on the fitness function the GA evolves by selecting the fittest individuals to

the next generation. Since the fitness function contains multiple objectives, the fitness

function is computed by ranking the chromosomes based on the non-dominated criteria

(explained in Section 6.3.2). This process is repeated for a maximum number of genera-

tions specified as an input parameter.



For the memory architecture exploration problem in DirPME, each individual chromosome

represents a physical memory architecture. As shown in Figure 6.7, a chromosome consists

of two parts: (a) number of logical memory banks (Li), and (b) list of physical memory

modules that form the logical memory bank. Once again we assume that each logical

memory bank is constructed using at most k physical memories. It is important to note

here that a key difference between the LME2PME and DirPME approaches is that, in

the LME2PME approach the number of logical memory bank is fixed (equal to Nl).

Hence the chromosome are all of the same size. However in DirPME each Li can be

of a different size. Hence the chromosomes are of different sizes. Thus, a chromosome

is a vector of d elements, where d = Li ∗ k + 1 and Li is the number of logical memory

banks for ith chromosome. The first element of a chromosome is Li and it can take value in

6.3 Direct Physical Memory Exploration (DirPME) Framework 123

(0 .. maxbanks), where maxbanks is the maximum number of logical banks given as input

parameter. The remaining elements of a chromosome can take a value in (0 .. m), where

1 .. m represent the physical memory module id in the semiconductor vendor memory

library. Here the index 0 represents a void memory (size zero bits) to help the memory

allocation step to construct physical memories.

Figure 6.7: GA Formulation of Physical Memory Exploration

For decoding a chromosome, first Li is read and then for each of the Li logical banks,

the chromosome has k elements. Each of the k elements are integers used to index into

semiconductor vendor memory library. With the k physical memory modules, correspond-

ing to a logical memory bank, a rectangular memory bank is formed. We have used the


same memory allocator (described in Section 6.2.3.1 which performs exhaustive combi-

nations with the k physical memory modules to get the largest logical memory with the

required word size. In this process it may happen that some of the physical memory mod-

ules may be wasted. For example, if k=4, then if the 4 elements are 2K×8bits, 2K×8bits,

1K×8bits, and 16K×8bits and if the bit-width requirement is 16-bits then our memory

allocator builds a 5K×16bits logical memory bank from the given 4 memory modules.

Note that 11K×8bits is wasted in this configuration and this architecture will have a low

fitness as the memory area will be very high but is considered in the exploration process

nonetheless. The memory area of a logical memory bank is the sum of the memory area

of all the physical memory modules. This process is repeated for each of the Ln logical

memory banks. The memory area of a memory architecture is the sum of the area of all

the logical memory banks.



selection of an individual depends on its fitness; an individual with a higher fitness has

a higher probability of contributing one or more offspring to the next generation. In

every generation, from the P individuals of the current generation, M more off-springs

are generated using mutation and crossover operators, resulting in a total population of

(P + M). From this P fittest individuals survive to the next generation. The remaining

M individuals are annihilated. Crossover and mutation operators are implemented in the

standard way. The crossover operation is illustrated in Figure 6.7.


For each of the individuals, the fitness function computes Marea, Mpow and Mcyc. The

value of Mcyc is computed by data layout using the heuristic explained in Section 3.5.

The Marea is obtained from the memory mapping block, which is the sum of area of all

the memory modules used in the chromosome.

Memory power computation is performed in the same way as described in Section


6.2.3.3. Once the Marea, Mpow and Mcyc are computed, the chromosomes are ranked as

per the process described in Section 6.2.3.3.



We have used the same set of benchmark programs and profile information as in the earlier

chapters. For performing the memory allocation step, we have used TI’s ASIC memory

library. The area and power numbers are obtained from the ASIC memory library.

The results from the LME2PME method are presented in the following section. After

that we present the results from the DirPME framework. Finally we compare the results

from LME2PME and DirPME.

6.4.2 Experimental Results from LME2PME

As discussed in Section 6.2, the LME2PME approach performs memory allocation ex-

ploration on the set of Pareto-optimal logical memory architectures, which are obtained

from the LME, with the objective to obtain Pareto-optimal physical memory architectures

that are interesting from a area, power and performance viewpoint. Figures 6.8, 6.9 and

6.10 present the results of the LME2PME approach for all the 3 applications. In these

figures, the x-axis represents the total memory area (normalized) required by a physical

memory architecture and the y-axis represents the total power (normalized) consumed

by the memory accesses. In the figures, each plot corresponds to a set of performance

operating points from the LME. Note that the performance points are grouped to reduce

the number of plots so that it is easier to analyze the results. Performance band 0 − 0.1

corresponds to an operating point that resolves > 90% memory stalls (from the on-chip

memory bank conflicts) and hence is a high performance operating point. Similarly, the

performance band of 0.8 − 0.9 corresponds to an operating point that resolves only less

than 20% of memory stalls and hence is a low performance operating points.


For each of the pareto-optimal logical memory architecture, the memory allocation

exploration step constructs a set of physical memory architectures that have different

area-power operating point. Note that the performance (number of memory stalls) remain

unchanged from the LME step. Each point in the Figures 6.8, 6.9 and 6.10 represents

a physical memory architecture. It can be observed from these figures that each plot

presents a wide choice of area-power operating points in the physical view. Note that

the plots are arranged from the high performance band to low performance band. Each

plot starts from a high-power and low-area region and ends in a low-power and high-area

region. Observe that all the high performance (low memory stalls) plots operate on a high

area-power region and a low performance (high memory stalls) operating points have a

lower area-power values. Thus, from a platform design view point, a system designer

needs to be clear on the critical factor among area, power and performance. Based on

this information, the system designer can select the appropriate set of operating points

that are intresting from the system design perspective.

6.4.3 Experimental Results from DirPME

This section presents the experimental results on the multi-objective Memory Architecture

Exploration. The objective is to explore the memory design space to obtain all the non-

dominated design solutions (memory architectures) that are Pareto optimal with respect

to area, power and performance.

The Pareto-optimal design points identified by our framework for the voice encoder

application are shown in Figure 6.11. It should be noted that the non-dominated points

seen by the multi-objective GA are only near optimal, as the evolutionary method may

result in another design point in future generations that could dominate. One can observe

a set of points for each x-z plane (memory power - memory stalls) corresponding to a given

area. These represent the trade-off in power and performance for a given area. The same

graph is plotted in a 2D-graph in Figure 6.12 where architectures which require an area

within a specific range are plotted using the same color. These correspond to the points

in a set of x-z planes for the area range.


Figure 6.8: Voice Encoder: Memory Architecture Exploration - Using LME2PME Ap-proach

Figures 6.12, 6.13 and 6.14 show the set of non-dominated points each corresponding

to a Pareto Optimal Memory Architecture for the 3 applications. It can be observed

from Figures 6.12, 6.13 and 6.14 that the increase in memory area results in improved

performance and power. Increased area will translate to one or more of on-chip memory,

increased number of memory banks, and more dual-port memory — all these are essential

for improved performance. We look at the optimal memory architectures derived by our

framework. In particular we consider (a) R1, (b) R2 and (c) R3 in each of the figures.

The region R1 corresponds to (high performance, high area, high power); R2 corresponds

to (low performance, high area, low power); and the region R3 corresponds to (medium

performance, low area, medium power). Since the memory exploration design space is very

large, it is important to focus on regions that is critical to the targeted applications. The

region R1 has memory architectures with large dual-port memory that aids in improved

performance but also is a cause for high power consumption. The region R2 has large


Figure 6.9: MPEG: Memory Architecture Exploration - Using LME2PME Approach

number of memory banks of different size. This helps in reducing the power consumption

by keeping the data sections with higher access frequency to smaller memory banks.

But the region R2 does not have dual-port memory modules and hence results in low

performance. But the presence of a higher number of memory banks increases the area.

The region R3 does not have dual port memory modules and also has lesser number

of on-chip memory banks. Since the memory banks are large, the power per access is

resulting in higher power consumption. Note that for a given area there can be more

than one memory architecture. Also it can be observed that for a fixed memory area, the

design points are Pareto optimal with respect to power and performance. Observe the

wide range of trade-off available between power and performance for a given area. We

observe that by trading off performance, power consumed can be reduced by as much as

70-80%.

Table 6.1 gives details on the run-time, the total number of memory architectures

explored and the number of non-dominated (near-optimal) points for each of the applica-

tion. Note that even the number of non-dominated design solutions is also large. Hence


Figure 6.10: DSL: Memory Architecture Exploration - Using LME2PME Approach

Table 6.1: Memory Architectures Explored - Using DirPME Approach

Application Time Taken No of Arch No of non-explored dominated

pointsMpeg Enc 2.5 hours 9780 670Vocoder 3.5 hours 13724 981

DSL 2 hours 7240 438

to select an optimal memory architecture for a targeted application, the system designer

needs to follow a clear top down approach of narrowing down the region (area, power,

performance) of interest and then focus on specific memory architectures. The table also

reports execution time taken on a standard desktop (Pentium 4 with 1.7Ghz). As can be

seen the execution time for each of these application is fairly low.


01

23

x 105

0.4

0.6

0.8

1

1

1.2

1.4

1.6

1.8

2

2.2

2.4

x 105

Memory Area

Encoder48 − Memory Architectcure Exploration

Memory Stalls

Mem

ory

Pow

er

Figure 6.11: Voice Encoder (3D view): Memory Architecture Exploration - Using DirPMEApproach

6.4.4 Comparison of LME2PME and DirPME

In this section we compare the non-dominated points from the LME2PME and DirPME

approaches. Table 6.2 presents data on the total number of non-dominated points obtained

from LME2PME and DirPME. The number of unique non-dominated points listed in

column 4 represents the solutions that are globally non-dominated but present only in

one of LME2PME or DirPME approach. The presence of unique non-dominated points

in one approach means that this point is missing in the other approach. The ratio of

column 4 to column 3, in a way represents the efficiency of an approach. We observe that

this ratio is low for DirPME approach compared to LME2PME. The number of unique

non-dominated points in DirPME increases if time allotted to DirPME is increased.


Figure 6.12: Voice Encoder: Memory Architecture Exploration - Using DirPME Approach

Further, the column 5 of Table 6.2 reports the number of non-dominated points iden-

tified by one method which gets dominated by points in the other method. For example,

for the MPEG encoder benchmark, 709 of the non-dominated design points reported by

DirPME are in fact dominated by certain design points seen by the LME2PME approach.

As a consequence the unique non-dominated points reduces to 26 for this benchmark.

In contrast, LME2PME fares better with 175 non-dominated points of which none are

dominated by DirPME approach. In fact this trend is observed almost for all bench-

marks. Thus the data from experiments point that LME2PME performs a better job

than DirPME.

One concern that still remains is the set of unique non-dominated points identified by

DirPME but not by LME2PME. If these design points are interesting from a platform

based design, then to be competitive the LME2PME approach should at least find a close

enough design point. In order to quantitatively assess this, we find the minimum of the

Euclidean distance between the each unique non-dominated point reported by DirPME


Figure 6.13: MPEG Encoder: Memory Architecture Exploration - Using DirPME Ap-proach

to all the non-dominated points reported by LME2PME. The minimum distance is nor-

malised with respect to the distance between the unique non-dominated point to the

origin. This metric in some sense represents how close a non-dominated in DirPME ap-

proach is to a point in LME2PME. If we could find an alternate non-dominated point

in LME2PME at a very close distance to the unique non-dominated point reported by

DirPME, then the LME2PME’s solution space can be considered as an acceptable super-

set. In column 6, we report the average (arithmetic mean) minimum distance of all unique

non-dominated points in DirPME to the non-dominated points in LME2PME. A similar

metric is reported for the unique non-dominated points identified by DirPME. We also

report the maximum of the minimum distance for all unique non-dominated points in col-

umn 7 of Table 6.2. The worst case average distance from unique non-dominated points is

0.46% for LME2PME and 0.49% for DirPME. Thus for every unique non-dominated point

reported by DirPME, the LME2PME method can find a corresponding non-dominated


Figure 6.14: DSL: Memory Architecture Exploration - Using DirPME Approach

point within a distance of 0.46%. In colum 7, we report the maximum of minimum dis-

tance of all non-dominated points in DirPME to the non-dominated points in LME2PME.

The same metric is presented the other way, i.e. the maximum of minumum distance of

all non-dominated in LME2PME to the non-dominated points in DirPME. Observe from

cloumn 6 that for every non-dominated point that is missing in LME2PME and reported

in DirPME, we can find a close enough non-dominated point in LME2PME at most within

4.1% distance from the missing point for MPEG benchmark. Simillarly, for every new

non-dominated point reported by LME2PME, we can find a close enough non-dominated

point in DirPME at most within 6.2% distance from the missing point. Finally, in column

8, the run-time for all the benchmarks for both approaches are reported. Note that the

DirPME approach takes significantly more time than the LME2PME approach.

In summary, we observe that LME2PME finds more non-dominated points in general

and offers better solution quality for a given time. However, since for every unique non-

dominated point in LME2PME, we can find a very close non-dominated point result in


Table 6.2: Non-dominant Points Comparison LME2PME-DirPME

Appli- Method Num of Unique No of avg max of run-cation non-dom ND pts Dom- min dist min dist time

(ND) inated from frompoints pts unique unique

NDs NDsMpeg LME2PME 175 175 0 0.22% 4.1% 0.45.47Enc DirPME 735 26 709 0.37% 6.2% 7.17.52

Vocoder LME2PME 214 192 22 0.04% 0.23% 0.40.52DirPME 558 13 545 0.49% 6.8% 3.08.54

DSL LME2PME 134 114 20 0.46% 3.75% 0.26.23DirPME 1093 12 1081 0.38% 7.08% 4.26.34

DirPME and vice versa, we can conclude that both the approaches perform very closely.

Further, the DirPME approach operates on a much bigger search space. Hence we expect

DirPME approach to catch-up and fare equally well or fare even better as compared to

LME2PME approach when sufficient time is given.

6.5 Related Work

Memory architecture exploration is performed in [18] using a low-energy memory de-

sign method, referred as VAbM, that optimizes the memory area by allocating multiple

memory banks with variable bit-width to optimally fit the application data. Their work

addresses custom memory design for application specific hardware accelerators. Whereas

our work focuses on defining memory architecture for programmable processors of embed-

ded SoC. In [15], Benini et al., present a method that combines the memory allocation and

data layout together to optimize the power consumption and area of memory architecture.

They start from a given data layout and design smaller (larger) bank size for the most

(least) frequently accessed memory addresses. In our method the data layout is not fixed

and hence it explores the complete design space with respect to area, performance and

power. Performance-energy design space exploration is presented in [72]. They present

6.6 Conclusions 135

a branch and bound algorithm which produces Pareto trade-off points representing dif-

ferent performance-energy execution options. In [62], an integrated memory exploration

approach is presented which combines scheduling and memory allocation. They consider

different speed of memory accesses during memory exploration. They consider only per-

formance and area as objectives and they output only one design point. In our work we

consider area, power and performance as objectives and we explore the complete design

space to output several hundreds of Pareto optimal design points.

There are other methods for memory architecture exploration for target architectures

involving on-chip caches [9, 51, 52, 64]. We compare them with our memory architecture

exploration approach for hybrid architectures described in the next chapter.

The memory allocation exploration step of LME2PME approach is an extension of

memory packing or memory allocation process [63]. Memory allocation step typically

constructs a logical memory architecture with a set of physical memories considering

minimizing memory area as the objective criteria. But in our approach, the memory allo-

cation exploration has to consider two inputs: (a) area optimization by picking the right

set of memory modules and (b) power optimization by considering the memory access

frequency of data-sections placed in a logical bank. Note that these are conflicting ob-

jectives and our approach outputs Pareto-optimal design points which present interesting

trade-offs for these objectives.

6.6 Conclusions

In this chapter we presented two different approaches for Physical Memory Architecture

Exploration. The first method, called LME2PME method, is a two step process and

an extension of the LME method described in Chapter 4. The LME2PME method of-

fers flexibility with respect to exploring the design spaces in logical and physical memory

architectures independently. This enables the system designers to start the memory archi-

tecture definition process without locking the technology node and semiconductor vendor

memory library. The second method is a direct physical memory architecture exploration

(DirPME) framework that integrates memory exploration, logical to physical memory


mapping and data layout.

Both LME2PME and DirPME approaches addressed three of the key system design

objectives (i) memory area, (ii) performance and (iii) memory power. Our approach

explores the design space and gives a few hundred Pareto-optimal memory architectures

at various system design points in a few hours of run time. We have presented a fully

automated approach that meets the time-to-market requirements.

In the next chapter, we extend the framework to address cache based memory archi-

tectures.

Chapter 7

Cache Based Architectures

7.1 Introduction

In the previous chapters, memory architecture exploration frameworks and data layout

heuristics are presented for target architectures that are primarily Scratch-Pad RAM

(SPRAM) based. Many SoC designs on the other hand also include cache in their memory

architecture [77] as caches provide comparable performance benefits of SPRAM but with

lower software overhead [83] at both program development time - requiring very little

data layout and management responsibilities from the application developer - and run-

time - movement of data from off-chip memory to cache is transparent and managed by

hardware. Hence in this chapter we consider memory architecture with both SPRAM and

cache. The work in this chapter also applies to memory architectures that have on-chip

memories that could be configured both as cache and Scratch-Pad RAM.

We discussed in Chapter 6 about how the presence of caches alter the objective func-

tions in memory architecture exploration and also for data layout heuristics. In a cache

architecture if two different data sections that are accessed alternatively are mapped to

the same cache sets, it causes a large number of conflict misses [33], potentially resulting

in no benefits from the cache. Hence it is important to address memory exploration and

data layout approaches for cache based architectures. Further, the memory exploration

problem becomes more challenging if the target architecture consists of both SPRAM

138 Cache Based Architectures

and cache. In this chapter, we address the memory architecture exploration problem for

hybrid memory architectures that have a combination of SPRAM and Cache.

As discussed in Chapter 4, the evaluation of a memory architecture cannot be sep-

arated from the problem of data layout, which physically places the application data in

the memory. A non-optimal data layout will yield an inferior performance even on a very

good memory architecture platform, thereby leading the memory exploration search path

to go in a wrong direction. Hence before addressing the memory architecture exploration

problem, for a cache based memory architecture, it is important to have a efficient data

layout heuristic.

For SPRAM-Cache based architectures, a critical step is to partition the data place-

ment between on-chip SPRAM and external RAM. Data partitioning aims at improving

the overall memory sub-system performance by placing data sections in SPRAM that has

the following characteristics: (a) higher access frequency, (b) over-lapping life time with

many other data, and (c) poor spatial access characteristics. By placing all data that ex-

hibits the above characteristics in SPRAM results in reducing the number of potentially

conflicting data in Cache and hence the reduced cache misses leading to overall memory

sub-system performance.

Typically the SPRAM size is small and hence it is not possible to accommodate all the

data identified for SPRAM placement. Hence, even after data partitioning, there will be a

significant number of potentially conflicting data placed in external RAM. If these data are

not carefully placed based on the off-chip RAM, there will be a significant number of cache

misses resulting in lower system performance. Cache conscious data layout addresses this

problem and aims at placing data in external RAM (off-chip RAM) with the objective to

reduce cache misses. The mapping of data from off-chip RAM to L1-cache is dictated by

the cache size and associativity. Hence the data-sections which map to the same cache

set, when accessed alternatively can incur a large number of conflict misses. Careful

analysis of data access characteristics and understanding of temporal access pattern of

the data structures is required in order to come up with cache conscious data layout that

minimize conflict misses. A number of earlier approaches address the problem of data

7.1 Introduction 139

layout mapping for cache architecture in embedded systems [17, 21, 22, 41, 50, 53].

In this chapter our aim is to perform memory architecture exploration and data layout

in an integrated manner, assuming a hybrid architecture which includes both on-chip

SPRAM and data cache (see Figure 7.1). As a first step, we address the data layout

problem for Cache-SPRAM based architecture. We address this problem using a two step

approach for each memory architecture: (a) data partitioning to divide the data between

SPRAM and cache with the objective to improve overall memory sub-system performance

and power and (b) Cache conscious data layout to minimize the number of cache misses

within a given external memory address space.

Figure 7.1: Target Memory Architecture

The major contributions of this chapter are:

• an efficient heuristic to partition data between SPRAM and caches based on access

frequency, temporal and spatial locality in access patterns;

• a data layout heuristic for data caches that improves run-time and reduces the

off-chip memory address space usage; and

• hybrid memory architecture exploration with the objective to improve run-time

performance, power consumption and area.

The rest of this chapter is organized as follows. In the following section we give an

overview of the proposed method. In Section 7.3 we explain our data partitioning heuris-

tic. In Section 7.4, describes the cache conscious data layout heuristic. In Section 7.5, we

present the experimental results. We discuss related work in Section 7.6. Conclusions are

presented in Section 7.7.


7.2 Solution Overview

Figure 7.2 presents our memory architecture exploration framework. Our proposed mem-

ory exploration framework consists of two levels. The outer level explores various mem-

ory architectures while the inner level explores placement of data sections (data layout

problem) to minimize memory stalls. More specifically the outer level, the memory ar-

chitecture exploration phase, targets the optimization of cache and SPRAM size and the

organization of cache architecture, including cache-line size and associativity. We use

an exhaustive search1 for memory architecture exploration by imposing certain practical

constraints (such as, the memory bank size is always a power of 2) on the architectural

parameters. Although these constraints limit the search space, they still allow all “prac-

tical” architectures to be considered and at the same time help to reduce the run-time

of the memory exploration phase drastically. The exploration module takes the applica-

tion’s total data size as input and provides an instance of memory architecture by defining

(a)cache size (b) cache block size (c) cache associativity and (d) SPRAM size2. Based on

the SPRAM size and the application access characteristics, the data partitioning heuris-

tic identifies the data sections to be placed in SPRAM. The remaining data sections are

placed in off-chip RAM. The details of the data partitioning heuristic are presented in

Section, 7.3.

The cache conscious data layout heuristic assigns addresses to the data sections placed

in off-chip RAM such that these data do not conflict in the cache. The data layout heuristic

uses the temporal access information as input to find the optimal data placement. The

objective is to minimize the number of cache misses. In Section, 7.4 we discuss the

proposed cache conscious data layout.

The data partitioning heuristic and data layout heuristic together place the application

data in SPRAM and off-chip RAM respectively. From the temporal access information

1Alternative approaches such as genetic algorithm or simulated annealing could also be used here.However we found the exhaustive approach does explore all practical memory architectures in a reasonableamount of computation time.

2The proposed framework can easily be extended to consider SPRAM organization parameters suchas, number of banks, number of ports etc. We do not consider it here as these were extensively dealtwith in the earlier chapters.

7.2 Solution Overview 141

of data sections and access frequency information, the run-time performance in terms of

memory stall cycles is computed. The memory stalls include stall cycles due to concurrent

accesses to the same single-ported SPRAM bank, stall cycles due to cache misses and miss-

penalty (off-chip memory access to fetch the cache block). The software eCacti [45] is used

to obtain the power per cache read-hit, read-miss, write-hit and write-miss. The SPRAM

power per read access and power per write access are obtained from the semiconductor

vendors ASIC memory library. The area for a given cache architecture is computed using

eCacti [45] and the area for SPRAM is obtained from the memory library.

Figure 7.2: Memory Exploration Framework

The exploration process is repeated for all valid memory architectures and the area,

power and performance are computed for each of these. The last step is to identify the list

of “optimal” architectures. Since this is a multi-objective problem, all the solution points

are evaluated according to the Pareto optimality conditions given by Equation 6.1 in

Section 6.2.3.3. According to this equation, (Mapow, Ma

cyc,Maarea) and (M b

pow, M bcyc,M

barea)

are the memory power, memory cycles and memory area for memory architecture A and


B respectively, then A dominates B if the following expression is true.

(((Mapow < M b


cyc) ∧ (Maarea ≤ M b

area))

∨((Macyc < M b


pow) ∧ (Maarea ≤ M b

area))

∨((Maarea < M b

area) ∧ (Macyc ≤ M b


pow)))

From the set of solutions generated by the memory architecture exploration module, all the

dominated solutions are identified and removed. The non-dominated solutions form the

Pareto optimal set, which represents the set of good architectural solutions that provide

interesting design trade-off points from power, performance, cost view point.

7.3 Data Partitioning Heuristic

As cache structure has associated tag overheads, SPRAM consumes much less area than

caches on a per-bit basis [12]. Further, SPRAM memory accesses consume less power

than a memory access that is a cache-hit [12]. While the data sections mapped to off-chip

memory share the cache space dynamically and in a transparent manner, SPRAM space

is assigned to data sections exclusively if dynamic data layout is not used. As a result, the

usage of SPRAM is costly from a system perspective as it gets locked to a specific data

after data section layout unlike in caches, where the space is effectively reused through

dynamic mapping of data by hardware. Hence, SPRAM has to be carefully utilized and

the objective in a memory architecture exploration should be to minimize the SPRAM

size.

The objective of data partitioning is to identify data sections that must be placed in

SPRAM for best performance. We refer to a set of data (one or more scalar variables

or array variables) that are grouped together as one data-section. A data-section forms

an atomic unit that will be assigned a memory address. All data that are part of a data

section are placed in memory contiguously. An example of a data section is an array data

structure.

In order to identify data sections that should be mapped to SPRAM, our heuristic

7.3 Data Partitioning Heuristic 143

uses different characteristics of the data section. These include the access frequency, the

temporal access pattern and the spatial locality pattern. These are explained below.

To model the temporal access pattern of different data sections, a temporal relationship

graph (TRG) representation has been proposed in [17]. A TRG is an undirected graph,

where nodes represent data sections and an edge between a pair of nodes represents that

the two successive references of either of the data sections is interleaved by a reference to

the other. The weight associated with an edge (a, b) represents the number of times such

interleaved accesses of a and b has occurred in the access pattern. We illustrate these

ideas with the help of an example.

Let there be 4 data-sections a, b, c and d and the access pattern of these data sections

in the application be:

a a a b c b c b c b c d d d d d a a a a a a a c a c a a c a c

Figure 7.3: Example: Temporal Relationship Graph

For this access pattern the TRG is shown in Figure 7.3. Given a trace of data memory

references, the weight associated with (a, b), denoted by TRG(a, b) is the number of times

that two successive occurrences of a are intervened with at least one reference to b or vice

versa. As an example, for the pattern bcbcbcb, TRG(b, c) = 5. Note that reference to

c intervenes successive references to b on three occasions and references to b intervenes


successive references of c twice making the TRG(b, c) = 5. For the given pattern the

TRG(b, d) = 0 as there are no interleaved accesses. Hence no edge exist between b and

d. TRG is computed for all the data sections from the address trace collected from an

instruction set simulator. We define STRG(i) as the sum of all TRG weights on the edges

connected to node i. As an example from Figure 7.3, STRG(a) = 10.

Next, We define a term, spatial locality factor, which gives a measure of spatial locality

in the access trace for each data section. The spatial locality is influenced by the stride in

accessing different elements of the data section. The spatial locality factor is computed

by determining the number of misses incurred by that data section on a cache with a

single block by the filtered access trace that contains only accesses pertaining to that

data section. The spatial locality factor is the ratio of the number of such misses to the

size of the data section. For example if the accesses to data section b in the filtered trace

bbbb correspond to cache blocks b1b2b1b1, where b1 and b2 correspond to different blocks

(determined by the cache block size) and size of data section b is Sb cache blocks then the

spatial locality factor is 3/Sb.

There are three parameters that control the decision to keep a data section in an

on-chip SPRAM.

1. Access Frequency (Af) : Placing the most frequently accessed data section in

SPRAM gives better power consumption and better run-time performance.

2. Temporal Access Characteristics : A data section is said to be conflicting if it gets

accessed along with many other data sections. Placing the most conflicting data

section in SPRAM reduces the number of cache conflict misses and hence improves

the overall memory subsystem performance. This parameter is computed from the

TRG. The STRG factor is a direct indication of the extent to which a data section’s

life-time overlaps with other data sections.

3. Spatial Locality Factor (SLF) : Data sections that have lesser spatial locality factor

uses more cache lines simultaneously and thereby reduce the available cache space for

other data. Also, such data exhibits less spatial reuse, causing more cache misses

7.3 Data Partitioning Heuristic 145

Table 7.1: Input Parameters for Data Partitioning AlgorithmNotation Description

N Number of data sectionsTRG(a, b) Temporal access pattern between node a and bSTRG(a) Sum of trg weights on all edges connected to node a

AF (a) Access Frequency of data section aSLF (a) Spatial Locality Factor of data section a

nSTRG(a) Normalized STRG(a)nAF (a) Normalized AF (a)nSLF (a) Normalized SLF (a)

CI(a) Conflict Index of data section a

which in turn increases the power consumption due to off-chip memory accesses.

Hence, it is both power and performance efficient to place a data section that has

less spatial locality factor in SPRAM.

Thus, a frequently accessed data that conflicts most with the rest of data and also

exhibits less spatial locality is an ideal candidate to be placed in SPRAM as this gives the

best performance from an overall memory subsystem perspective. For each of the data

sections, a conflict index is computed using the three parameters mentioned above. The

conflict index of a node corresponding to data section s is computed as follows.

nSTRG(s) =STRG(s)(∑Ni=1 STRG(i)

) (7.1)

nAF (s) =AF (s)(∑Ni=1 AF (i)

) (7.2)

nSLF (s) =SLF (s)(∑Ni=1 SLF (i)

) (7.3)

CI(s) = nSTRG(s) + nAF (s) + nSLF (s) (7.4)

In the above equations, SLF (s) and AF (s) correspond to the spatial locality factor

and access frequency of s respectively. The terms in the LHS of equations 7.1, 7.2 and


7.3 are normalized factors. Higher the conflict index, more suitable the data section is

for SPRAM placement. Our data partitioning heuristic algorithm is explained in Figure

7.4. The greedy heuristic sorts the data sections based on the conflict index and assigns

data sections that have the highest conflict index to SPRAM. The corresponding node

is removed from the TRG and the conflict index for the remaining data sections are

recomputed. Note that the above step is performed for every data section identified to be

placed in SPRAM. This process is repeated either until the SPRAM space is full or until

there are no more data sections to be placed.

7.4 Cache Conscious Data Layout

7.4.1 Overview

The data partitioning step places the most conflicting data in SPRAM and thereby reduces

the possible conflict misses in cache. However, the SPRAM size typically is very small

and only a few data-sections would have been placed in SPRAM3. The remaining data

sections still needs to be placed carefully in the cache to reduce the cache misses. In this

section we will be discussing the cache conscious data layout.

The problem of cache-conscious data layout is to find optimal data placement in off-

chip RAM with the following objectives: (a) to reduce the number of cache misses and (b)

to reduce address space used in off-chip RAM. In other words, the objective is to reduce

the “holes“ in off-chip RAM after placement. By this, we mean that the data sections are

placed in the off-chip RAM, in such a manner that the gap between data sections which

is left to reduce conflict misses is reduced. These gaps lead to wasted memory space and

hence increase hardware cost. To the best of our knowledge, reducing cache misses (the

first objective) has been the sole objective targeted by all earlier data layout approaches

published [17, 22, 41, 50, 53]. But it is very important to consider objective (a) in the

context of objective (b) for the following reasons.

3As mentioned earlier, data placement within the SPRAM can be done in a subsequent phase usingany of the data layout methods discussed in Chapter 3. We do not experiment this as this has beenextensively dealt with in the previous chapters.

7.4 Cache Conscious Data Layout 147

Algorithm: SPRAM-Cache Data PartitioningInputs:

N = Number of data sectionsAccess Frequency of all data sectionsTemporal Relationship Graph (TRG)Spatial Locality Factor (SLF)Data Section Sizes

Output:List of data sections to be placed in SPRAM

begin1. Compute access frequency per byte for all data sections2. Normalize the access frequency per byte for all data sections3. for i= 0 to N -1

3.1 compute STRG(i); sumSTRG += STRG(i);4. for i= 0 to N -1

4.1 nSTRG(i) = STRG(i)/sumSTRG;5. for i= 0 to N -1

5.1 compute SLF(i); sumSLF += SLF(i);6. for i= 0 to N -1

6.1 nSLF(i) = SLF(i)/sumSLF;7. for i= 0 to N -1

7.1 conflict-index(i) = nSTRG(i) + nAF(i) + nSLF(i);8. sort the data sections in descending order with respect to conflict-index9. while (available space in SPRAM)

9.1 identify the data section s with the highest conflict index9.2 place s in SPRAM if it fits within the available space9.3 update SPRAM available space to account for the above placement9.4 remove s from TRG9.5 Recompute STRG for the remaining nodes in TRG9.6 Recompute conflict index with the newly updated STRG

9. exitend

Figure 7.4: Heuristic Algorithm for Data Partitioning


• For SOC architectures with instruction cache and data cache that share the same

off-chip RAM, a data layout approach that optimizes only the data cache misses,

without considering optimization of off-chip RAM address space will use-up too

much address space by spreading the data placement, leaving many holes. This will

place severe constraints on code placement requiring the code to be placed across the

holes and in the remaining off-chip RAM. This may potentially result in additional

instruction cache misses. Hence, there is a chance that all the gains achieved by

optimizing the data cache misses is lost.

• A data layout approach which optimizes the data placement in off-chip RAM with-

out any holes will be independent of instruction cache placement. Hence, the ar-

chitecture exploration of data cache can be done independent of instruction cache.

For example, an application with 96K of data will have around 2700 hybrid archi-

tectures that are worth exploring. If the code placement is not independent of the

data layout and the code segments are placed in the holes created, then the memory

exploration process needs to consider both instruction and data cache configuration

together. This will increase the number of architectures considered. In such a sce-

nario, the number of architectures explored could increase to 50000+. Hence, it

is important to design a data layout algorithm that is independent of instruction

cache.

We formulate the cache conscious data layout problem as a graph partitioning prob-

lem [38]. Inputs to the data layout algorithm are (i) application’s data section sizes

and (ii) Temporal Relationship Graph. The data layout algorithm is explained in a

block diagram in Figure 7.5. The first step in the data layout problem is modelled

as a graph partitioning problem, where data sections are grouped into disjoint subsets,

such that the memory requirement for the data sections in a disjoint subset is less than

the cache size. More specifically, the first step is a k-way graph partitioning, where

k = dapplication data size/cache-sizee. The data sections in each of the partitions are

selected such that they have intervening accesses and hence can cause potential conflict

misses. Thus the output of graph partitioning step is k partitions with each partition


Figure 7.5: Cache Conscious Data Layout

having a set of data sections that conflicts among themselves the most and the partition

size is less than cache size. Since each of the k partitions is lesser than the cache size,

each of these partitions can be mapped into off-chip RAM address space that corresponds

to one cache page. This step eliminates all the conflicts between data-sections that are in

the same partition. The graph partitioning method is discussed in detail in Section 7.4.2.

The next step in the data layout is to minimize the possible conflicts between data-

sections that are in two different partitions. This is handled by the offset-computation

step. The details of the offset computation are presented in Section 7.4.3. Once the offset-

computation step assigns cache-block offsets for each of the data section, the address

assignment step allocates unique off-chip addresses to all the data-sections. Finally, using

the address assignment, the number of cache misses and the power consumed for cache


and off-chip memory accesses are computed which is used for identifying Pareto-optimal

solution. The following subsections details the graph partitioning heuristic and offset

computation heuristic.

7.4.2 Graph Partitioning Formulation

In this section we explain the graph partitioning heuristic which is a generalisation of

Kernighan-Lin [38] and operates on the temporal relationship graph for the data sections

that need to be placed in off-chip RAM. Note that this excludes all data sections that

have been mapped to SPRAM. Given a temporal relationship graph G‘ = {V, E, s, w},where V is the set of vertices representing data sections and E is the set of edges between

a pair of data sections representing a temporal access conflict. Further functions s and

w are associated respectively with the nodes and edges of the TRG; s(u) represents the

size of the data section associated with a node u and w(u, v) represents the number of

temporal access conflicts between a pair of nodes u and v. The weight function w(u, v)

is same as TRG(u, v), but only restricted to the data sections that need to be assigned

to the off-chip RAM. The graph partitioning problem aims at dividing G into m disjoint

partitions. A m-way partition of G is a collection of subsets Gi = {Vi, Ei}, such that

• the subsets are disjoint, Vi ∩ Vj = 0, for i 6= j

• ⋃mi=1 Vi = V

• ∀ ei = (u, v) ∈ G, is in Gi iff u ∈ Gi and v ∈ Gi

The objective of the graph partitioning step is to group the nodes such that the sum

of weights on the internal nodes is maximized. The objective function that needs to be

maximized is given in Equation 7.5 with the constraint given in Equation 7.6.

∑

i

∑

ej∈Gi

w(ej) (7.5)

∑

uj∈Gi

s(uj) ≤ cache-size (7.6)


An edge eext = (u, v) is said to be an external edge for a partition Gi if u ∈ Gi and

v /∈ Gi; i.e., if one of the nodes connected by e is in partition Gi and the other is not.

Similarly, an edge eint is said to be an internal edge if both the nodes it connects are

in the partition Gi. The sum of all the weights on the external edges in partition Gi

is referred as external cost (Ei =∑

eext, where eext ∈ Gi). The sum of all the weights

on the internal edges in partition Gi is referred as internal cost (Ii =∑

w(eint), where

eint ∈ Gi). The total external cost E =∑

i

∑eext∈Gi

w(eext). Thus the objective of the

partitioning problem is to find a partition with minimum external cost. Alternatively, the

graph partitioning problem can also be formulated as maximizing the total internal cost,

i.e.∑

i

∑eint∈Gi

w(eint) subject to the constraint∑

uj∈Gis(uj) ≤ cache-size∀Gi.

The optimal partitioning problem is NP-Complete [38, 66]. There are a number of

heuristic approaches [26, 47] to this problem, including the well known Kernigan Lin

heuristic [38] for two partitions. We extend the heuristic proposed in [38, 66] to solve our

problem. The Kernighan-Lin heuristic aims at finding a minimal external cost partition

of a graph into two equally sized sub-graphs. The heuristic achieves this by starting with

a random partition, and keeps swapping two nodes that gives the maximum gain. Gain

is computed as the difference between internal and external costs. Let us consider two

nodes a and b present in two different sub-graphs A and B respectively. We define external

cost(ECost) of a as Ea =∑

x∈B w(a, x) and internal cost (ICost) of a as Ia =∑

y∈A w(a, y)

for each a ∈ A. Similarly ECost and ICost of b are defined as Eb and Ib respectively. Let

Da = Ea− Ia be the difference between ECost and ICost for each a ∈ A. A result proved

by Kernighan and Lin [38] shows that for any a ∈ A and b ∈ B, if they are interchanged,

the reduction in partitioning cost is given by Rab = Da + Db − 2× w(a, b). The nodes a

and b are interchanged to partitions B and A respectively if Rab > 0.

In [66], the graph partitioning heuristic is generalized to an m-way partition. It starts

with a random set of m partitions and picks any two of the partitions and applies the

Kernighan-Lin heuristic repeatedly on this pair until no more profitable exchanges are

possible. Then these two partitions are marked as pair-wise optimal. The algorithm then

picks two other partitions to apply the heuristic. This process is repeated until all the


partitions are pair-wise optimal.

We have adapted the algorithm of [66] and added additional constraints to make it

work for our problem. The main constraints are as below:

1.∑

s(a) ≤ cache-size for all partitions;

2. if a data-section size s(a) > cache-size, then this data-section is placed in a partition

and marked optimal; and

3. Nodes a and b are interchanged to partitions B and A respectively only if Rab > 0

and if∑

a∈A s(a) < cache-size and if∑

b∈B s(b) ≤ cache-size

The output of the graph partitioning step is a collection of sub-graphs that maximizes

the internal cost and minimizes the external cost and ensures that no partition has a size

larger than the cache size4. Thus, each of the partition can be placed in the off-chip RAM

address space that maps to a cache page such that none of the data sections that are part

of the same partition will conflict in cache. Now we are left with optimizing the cache

conflicts that might arise because of conflicts from data sections belonging to two different

partitions. Since the external cost is already minimized, the number of such conflicts will

already be very less. The offset computation step, described in the following subsection,

aims at reducing conflicts caused by data sections belonging to different partitions.

7.4.3 Cache Offset Computation

The cache offset computation step aims at reducing cache conflict misses between data

sections that are part of two different partitions. Each partition is placed in the off-

chip RAM address space that corresponds to one cache page. It may be noted that the

ordering of the partitions does not have any impact on the cache misses. For each of the

data sections in a partition, a cache-block offset needs to be assigned which in turn is

used to determine a unique off-chip memory address for the data section.

4Obviously, a partition containing a data section whose size larger than the cache size will not obey thisproperty. But such data section can be considered to form l = ddata section size/cache-sizee consecutivepartitions, each less than or equal to cache size.


Algorithm: Offset Computation HeuristicInputs:

TRGblk values for all the data blocksExternal costs for all the partitions (Ei)Internal costs for all the partitions (Ii)Ei,uj

External costs for node uj in a partition Gi

Cache configurationData Section Sizes

Output:Offsets assigned each of the data sections

begin1. Sort the partitions in the decreasing order of external cost2. for i =1 to k partitions

2.1 Pick the partition Gi with the highest external cost2.2 Sort the data sections in descending order

with respect to the external cost Ei,uj

2.3 for alldata sections in Gi

2.3.1 pick the data section uj with the highest Ei,uj

2.3.2 evaluate placement cost by placing uj in each of theavailable cache-line for the target cache configuration2.3.2.1 place data uj in cache in the available cache line

with the constraint that the data section must be contiguously placed2.3.2.2 compute the cost of placement by using TRGblk information

for all the data blocks already placed in the data section2.3.2.3 store the cost of placement Cl for the cache line l2.3.2.4 repeat the last three steps for all possible cache lines

3.4.3 find the cache-line l that gives the minimal cost3.4.4 assign l as the starting point for uj

3.4.5 mark the cache lines from l tol + size(uj)/block − size as not available for other data sections in Gi

2.4 end for3. end for4. placement completeend

Figure 7.6: Heuristic Algorithm for Offset Computation


To decide the offset that gives the least number of conflicts, we compute the placement

cost for all possible placements of the data section inside a cache page. To compute the

placement cost, we use a fine grained version of TRG. Note that the TRG computed in

Section 7.3 is at the granularity of data section. But to determine which offset to place

a data section, the temporal access pattern needs to be computed at a finer granularity

level. We illustrate these ideas with the help of an example.

Let there be 2 data sections a and b of size 128 bytes and 64 bytes respectively.

Consider the following access pattern: a[0]b[0]a[60]b[1]a[61]b[2]a[62]b[3]

For this access pattern TRG(a,b) is 6 as explained in Section 7.3. Basically it means

data sections a and b are accessed 6 times in an interleaving way. However, for a direct

mapped cache of size 4KB with 32 byte block size, if a is placed in address k and b is

placed in off-chip address k + 4KB, will not result in any conflict misses even though the

TRG(a,b)=6. This is because a[60], a[61] and a[62] will map to a cache line (C +1), while

a[0], b[0], b[1], b[2] and b[3] will map to cache line C. Further as if a is placed in address

k and b is placed in address k + 4KB + 32B then it will result in 5 conflict misses.

Hence, to determine, the cost of placing the data section, on the number of conflict

misses, the TRG values are needed at a more granular level. For the above example, if we

keep the granularity level as 1 cache block then the data section a is divided into 4 data

blocks and data section b is divided in to 2 data blocks. We define a new term TRGblk

that represents the temporal access pattern among data blocks. This is similar to the

approach described in [17]. The above access sequence results in a0, b0, a1, b0, a1, b0, a1, b0,

where a0, a1 and b0 represent the first two (cache-block sizes) blocks of data section a

and first block of data section b. For the above example, TRGblk will consist of nodes

a0, a1, a2, a3, b1 and b2. For the access pattern given above, TRGblk(a1, b0) = 5 and all

other TRGblk values are 0. We use the TRGblk values to compute the cost of placement,

C(s,l), for a data section s in a cache offset l.

The offset computation algorithm is explained in the Figure 7.6. To begin with, the

partitions are ordered based on the total external cost (Ei). The partition Gi with the

highest external cost is selected first for offset computation. Data sections that are part


of partition Gi are ordered based on the external cost of the corresponding nodes in Gi.

Data section uj with the highest external cost (Ei,uj) is taken for offset computation

first. Data section uj is placed in each of the allowable cache lines and the placement

cost is computed with the help of TRGblk. Here, by allowable, we mean there should

be contiguous cache lines free in a cache page to accommodate the data section uj. For

example, if data section size is 128 bytes and cache block size is 32 bytes, then a feasible

cache line mean 4 contiguous lines are free. Note that at this point no offset is assigned

to the data section uj. Cost of placement C(uj,`) for data section uj is computed for all

allowable cache line ` from 1 to Nl, where Nl is the total number of cache lines. The cache

line ` that has the minimum cost is assigned to to data section uj and the cache lines

from l to size(uj)/line-size is marked as full so that these cache lines are not available for

any other data section in Gi. Note that this restriction is put to ensure that the cache

offsets for all data section in a partition Gi are assigned within one cache page and this

ensures that the amount of external address space used is close to the application data

size. The above process is repeated for all data sections in partition Gi. After this the

next partition Gi+1 with the highest external cost is selected for offset computation. This

process continues until all partitions are handled.



We have used Texas Instrument’s TMS32064X processor for our experiments. This pro-

cessor has 16K data cache and we have used Texas Instrument’s Code Composer Studio

(CCS) environment for obtaining profile data, data memory address traces and also for

validating data-layout placements. We have used 3 different applications - AAC(Advanced

Audio Codec), MPEG video encoder and JPEG image compression from the Mediabench

[43] for performing the experiments. We compute the TRG, sumtrg, and spatial locality

factor from the data memory address traces obtained from the CCS. We used eCacti

[45] to obtain the area and power numbers for different cache configurations. First, we


report experimental results demonstrating the benefits of our cache-conscious data layout

method. Subsequently in Section 7.5.4, we repeat the results pertaining to cache-SPRAM

memory architecture exploration.

7.5.2 Cache-Conscious Data Layout

In this section we present results on our cache conscious data layout and we compare our

results with the approach proposed by Calder [17]. We have used the above 3 mediabench

applications and 4 different cache sizes. In this experiment, for all the cache sizes we

have used a 32 byte cache-block size and direct mapped cache configuration. Table 7.2

presents the results of the data layout. Column 4 in Table 7.2 presents the number of

cache-misses incurred when the data-layout approach of [17] is used and the Column 5

gives the number of cache misses incurred when our data layout approach is applied. Our

approach performs consistently better and reduces the number of cache misses especially

for AAC and MPEG. Our method achieves upto 34% reduction (for AAC with 16KB

cache size) in cache misses. Also our approach consumes an off-chip memory address

space that is very close to the application data-size. This is by construction of the graph-

partitioning approach and avoiding gaps during data layout as explained in Section 7.4.

Whereas Calder’s [17] approach consumes 1.5 to 2.6 times the application data-size in the

off-chip address space to achieve the performance given in Table 7.2. This is a significant

advantage of our approach, as increased off-chip address space implies increased memory

cost for the SoC.

In Table 7.3, we present the results of our approach for different cache configurations

(direct mapped, 2-way and 4-way set associative caches). Note that these experiments

are performed with cache only architecture and no SPRAM. Observe that for all the

applications, the reduction in misses is significant for 2-way and 4-way set associative

caches. However, for the 4KB cache configuration for MPEG, the reduction in cache

misses is not much. This is due to the large data set (footprint) requirement for MPEG.

Also, observe that the data set for JPEG is much smaller and hence a direct mapped 16K

cache or 4-way set associative 8KB cache could resolve most of the conflict misses.


Table 7.2: Data Layout Comparison

Appli- Cache Number of Number of cache misses imrove-cation Size memory Calder Graph- ment

accesses [17] Partition (%)(our approach)

AAC 32KB 43 0 0 016KB Million 14746 9711 348KB 155749 128322 174KB 446912 385795 14

MPEG 32KB 92 17204 14574 1516KB Million 275881 224278 198KB 2332008 2314398 14KB 11919814 11919814 0

JPEG 32KB 38 0 0 016KB Million 0 0 08KB 2350 2112 104KB 10220 10294 -1

Table 7.3: Data Layout for Different Cache Configurations

Appli- Cache Number of cache missescation Size Direct 2-Way set- 4-way set-

Mapped Associative AssociativeAAC 32KB 0 0 0

16KB 9711 5252 41118KB 128322 66741 537324KB 385795 314122 260110

MPEG 32KB 14574 2122 71216KB 224278 123632 784128KB 2314398 1863214 13012574KB 11919814 10121122 9884788

JPEG 32KB 0 0 016KB 0 0 08KB 2112 112 104KB 10294 4300 3200


7.5.3 Cache-SPRAM Data Partitioning

In this section we present the results from our cache-SPRAM Data Partitioning method.

Figures 7.7, 7.8 and 7.9 present the results of data partitioning heuristic. In these figures,

the x-axis represents the SPRAM size and the y-axis represents the performance in terms

of memory stalls. Experiments were performed for three different cache sizes (4KB, 8KB

and 16KB). For each of the cache sizes, the SPRAM size is increased from 0 to application

data size. For each of the memory configuration, data partitioning and cache conscious

data layout is performed to obtain the memory stalls. The memory stalls refers to the

number of stalls due to the external memory accesses due to cache misses.

Figure 7.7: AAC: Performance for different Hybrid Memory Architecture

Observe that for all the applications, when the SPRAM size is increased, a significant

performance improvement is achieved for all the cache sizes. However, the performance

improvement is more pronounced in 4KB and 8KB caches than in 16KB caches. Observe

that for AAC, the 8KB cache + 24KB of SPRAM gives the same performance as a 16KB

cache with 4KB of SPRAM. The 16KB Cache and 4KB SPRAM consumes more area

than the 8KB Cache + 24KB SPRAM configuration. Simillarly, for JPEG, 4KB Cache

with 20KB of SPRAM gives the same performance as a 16KB Cache with no SPRAM.

This gives an architecture choice to the designers to select a configuration that suits


Figure 7.8: MPEG: Performance for different Hybrid Memory Architecture

Figure 7.9: JPEG: Performance for different Hybrid Memory Architecture

the target application. As we discussed earlier, both caches and SPRAM have their own

advantages. For instance, caches offer hardware managed resusable on-chip memory space

that provides feature extendability to the system. Whereas SPRAM provides predictable


performance and lower power consumption. Hence, the selction of architecture needs

careful analysis from different viewpoints.

Now we present the power consumption numbers for all the applications in Figures

7.10, 7.11 and 7.12. In these figures, the x-axis represents the SPRAM size and the y-axis

represents the total power consumed by the memory subsystem. There are three plots,

each for different cache sizes (4KB, 8KB, and 16KB). As expected, the power numbers for

16KB cache configurations is higher than the other two. However, in the all the figures,

observe that the power numbers converge towards the end for higher SPRAM sizes. This

is becase, for higher SPRAM sizes, most of the application’s critical data sections are

mapped to SPRAM and hence not much activity happens in cache. Thus, the power

numbers are mostly influenced by the SPRAM accesses. Observe that for 16KB cache,

the power numbers are higher for lesser SPRAM sizes and gradually decreases as the

SPRAM size increase.

Figure 7.10: AAC: Power consumed for different hybrid memory architecture

In summary, the system designer needs to look at the performance graphs for his

application, presented similar to those in Figures 7.7, 7.8 and 7.9 and also study the

power graphs similar to the ones presented in 7.10, 7.11 and 7.12 to arrive at a suitable


Figure 7.11: MPEG: Power consumed for different hybrid memory architecture

Figure 7.12: JPEG: Power consumed for different hybrid memory architecture

architecture. One more dimension that is not covered here is the memory area. The next

section looks at the memory architecture exploration, where the system designer can look


at the memory design space from a power, performance and area viewpoint.

7.5.4 Memory Architecture Exploration

In this section we present the results from our memory architecture exploration. As

mentioned in Section 7.2, we explore the Cache-SPRAM solution space with the following

parameters: (a) cache-size, (b) cache block-size, (c) cache associativity and (d) SPRAM

size. Again we have used the same 3 benchmark applications. As mentioned earlier, we use

an exhaustive search method for memory exploration by varying the above parameters.

We start with no SPRAM and a 4KB cache and keep increasing the cache sizes up to the

application data size (88KB, 108KB and 40KB for AAC, MPEG and JPEG respectively).

For each of the cache size explored, we then increase the SPRAM size from 0 to application

data size with a 4KB step increase. Also for each of the cache configurations we vary the

block size from 8 bytes to 64 bytes with a 8-byte step increase and associativity from 1

to 4. Based on the application data size, the number of memory configurations evaluated

varies from 1200 to 2800. From the total memory configurations evaluated, we compute

the non-dominated solutions based on the Pareto Optimal criteria explained in Section

7.2. Figures 7.13, 7.14, and 7.15 present the non-dominated solutions for AAC, MPEG

and JPEG respectively. In these figures, the x-axis represents the number of memory

stall cycles and the y-axis represents the power consumption. We have presented the

power vs. performance graph for different area bands5. We observe from the Figure

7.13 that as the area band increases, we get better power and performance. Note that

the solution points are converging from the top-left portion of the graph (which is a high

power and low performance region) to the lower left portion of the graph (which is the low

power and high performance region) as the area is increased. In Figure 7.14, the solution

on the top right corner has the memory configuration of 4K cache size, direct mapped

with 32 byte cache-block with no SPRAM. As we can observe this is a very conservative

architecture giving very low performance and high power consumption. On the other

5Again, due to proprietary reason, we present normalized area for different configuration, instead ofabsolute values.


hand, the solution in the lower left corner has the memory configuration of 8K cache

with 2-way set-associativity and 16-byte cache-block and 128K of SPRAM. This is a very

high end architecture consuming lot of area but gives the best performance and power

consumption. Thus the set of Pareto Optimal design points presents a critical view to

the designers to pick appropriate memory configurations that suit the application-system

requirements.

Figure 7.13: AAC: Non-dominated Solutions

7.6 Related Work

7.6.1 Cache Conscious Data Layout

There are many earlier work that propose source code level transformations with the ob-

jective to improve the data locality. Loop transformation based data locality optimizing

algorithms are proposed by Wolf et al., [82]. They describe a locality optimization al-

gorithm that applies a combination of loop interchange, skewing, reversal and tiling to

improve the data locality of loop nests. Earlier work [21, 37, 53] propose source level


Figure 7.14: MPEG: Non-dominated Solutions

Figure 7.15: JPEG: Non-dominated Solutions


transformations such as array tiling, re-ordering data structures and loop unrolling to im-

prove cache performance. But we focus on optimizing object module level data placements

without any code modifications. We emphasize that this is important as application de-

velopment flow in embedded systems typically involves integration of many IPs and the

source code for them may not be available. Data layout optimization proposed by Panda

et al., addresses the scenario of data arrays placed in off-chip RAM addresses that are

multiples of cache size which results in thrashing due to cache conflict misses in a di-

rect mapped cache. They propose introducing dummy words (padding) between the data

arrays to avoid cache conflicts.

Data layout heuristics that aim at minimizing cache conflict misses have been proposed

in [17, 41]. The problem has been formulated as an Integer Linear Program (ILP) in

[41]. They also propose a heuristic method to avoid the long run-times of ILP solvers.

Calder et al., [17] use a Temporal Relationship Graph (TRG) that captures the temporal

access characteristics of data and proposes a greedy algorithm for cache conscious data

layout. While the approaches in [41, 17] target only the minimization of conflict misses en

masse, our approach aims at minimizing conflict misses within a certain off-chip memory

address space. The constraint of working within a certain external memory address space

is very important for memory architecture exploration, since this makes the instruction-

cache performance independent of data cache for architectures where the external memory

address space is common for both data and instruction caches, and thereby reducing the

memory architecture search space.

Chilimbi et al., [22] propose two cache friendly placement techniques — coloring and

clustering — that improves data structure’s spatial and temporal locality and there by

improving cache performance. Their approach works mainly for tree and tree like data

structures. They also propose a cache conscious heap allocation method which allocates

memory closer to contemporaniously accessed data objects based on programmer’s input.

This reduces the number of conflict misses. However, this approach will be expensive

in terms of performance as run-time decisions need to be taken. Embedded systems are


performance sensitive and hence usually the usage of dynamic heap objects are discour-

aged. Any additional run-time performance overhead in memory allocation will take away

the benefit that comes from reduced conflict misses. Further, critical sections of embed-

ded applications are typically developed in hand-written assembly language. Hence any

modifications in the layout of structures cannot be completely handled by compilers.

A greedy data layout heuristic is proposed in [65] which optimizes energy consumption

in horizontally partitioned cache architectures. Their approach uses the idea that the

energy consumed per access in a small cache is less than that in a larger cache. Hence,

for cache architectures that have a main cache and a smaller mini cache, the authors

show that a simple greedy data partitioning heuristic, which partitions data between the

main cache and the mini cache, performs well to reduce the overall energy consumption of

the memory subsystem. Our work addresses a different target memory architecture with

SPRAM and cache.

Palem et al. [50] propose a compile time data remapping method to reorganize record

data types with the objective to increase temporal access characteritics of data objects.

Their method analyzes program traces to mark data objects of records whose access

characteristics and field layout do not exhibit temporal locality. The authors propose a

heuristic algorithm to remap the fields of the data objects that were marked during the

analysis phase. The heuristic remaps the fields in data objects to improve the temporal

locality and thus avoids additional cache misses. Their approach is very efficient for record

type data structures like linked lists and trees. However, their approach requires compiler

support to reorganize the fields of data structures and the corresponding code changes to

access the remapped fields. Whereas our work focusses on the layout of data structures

that do not require code changes, which is an important constraint in IP-based embedded

application development flow.

7.6.2 SPRAM-Cache Data Partitioning

Integer Linear Programming (ILP) based approach to partition instruction traces be-

tween SPRAM and instruction cache with the objective to reduce energy consumption


has been propsed in [80]. Our work focuses on data partition between SPRAM and data

cache. Further, we consider DSP applications which typically have multiple simultaneous

memory accesses leading to parallel and self conflicts.

To the best of our knowledge, only [53] addresses data partitioning for SPRAM-cache

based hybrid architectures. They propose a data partitioning technique is presented that

places data into on-chip SRAM and data cache with the objective of maximizing per-

formance. Based on the life times and access frequencies of array variables, the most

conflicting arrays are identified and placed in scratch pad RAM to reduce the conflict

misses in the data cache. This work addresses the problem of limiting the number of

memory stalls by reducing the conflict misses in the data cache through efficient data

partitioning. They also demonstrate memory exploration of hybrid architectures with

their proposed data partitioning heuristic. However, their memory exploration frame-

work does not have an integrated cache-conscious data layout. They propose a model to

estimate the number of cycles spent in cache access. Our approach proposes data parti-

tioning based on three factors (i) access frequency, (ii) temporal access characteristics and

(iii) spatial access characteristics. Our proposed method is a comprehensive data layout

approach for SPRAM-cache based architecture as we perform data partitioning followed

by cache conscious data layout. Also our approach works on all the key system design

objectives such as area, power and performance.

7.6.3 Memory Architecture Exploration

Panda et al., proposes a local memory exploration method for SPRAM-cache based mem-

ory organization. They propose a simple and efficient heuristic to walk through the

SPRAM-cache space. For each of the memory architecture configuration considered, the

performance of the memory configuration is estimated by an analytical model. In [64],

an exhaustive search based exploration approach is proposed for a cache based memory

architecture which explores the memory design space based on parameters like on-chip

memory size, cache size, line size and associativity. The authors extend the work by

Panda et al., to consider energy consumption as the performance metrics for the memory


architecture exploration. Memory exploration for cache based memory architecture is also

considered by [51].

The main difference between the above works and our method is that our memory

architecture exploration framework integrates an efficient data layout heuristic as a part

of the framework to evaluate the memory architecture. Without an efficient data layout a

random mapping of application may result in a poor performance even for a good memory

architecture. Further, our memory architecture exploration framework considers multiple

objectives such as performance, area and power.

Memory hierarchy exploration problem in the Genetic Algorithm framework is pro-

posed in [52] and [9]; their target architecture consists of separate L1 caches for instruction

and data, and unified L2 cache. Their objective function is a single formula which com-

bines area, average access time and power. In [52], additional parameters such as bus

width and bus encoding are considered, and the problem is modeled in a multi-objective

GA framework. The main difference between their work and our work is the integration

of data layout as part of the memory architecture exploration framework. Absence of a

cache conscious data layout means that the application data may not have been efficiently

placed in off-chip RAM and hence will lead to a poor performance. A point to note here

is that the poor performance is a result of inefficient data placement and not due to the

cache configuration. The other main difference is that [52, 9] uses simulation based fitness

function evaluation which will limit the number of evaluations due to large run-time. In

comparison our approach uses an analytical model to compute fitness functions.

7.7 Conclusions

In this chapter we have presented a memory architecture exploration framework for

SPRAM-Cache based memory architectures. Our framework integrates memory explo-

ration, data partitioning between SPRAM and Cache, and cache-conscious data layout

to explore memory design space and presents a list of Pareto Optimal solutions. We have

addressed three of the key system design objectives viz., (i) memory area, (ii) performance

and (iii) memory power. Our approach explores the memory design space and presents

7.7 Conclusions 169

several Pareto Optimal solutions within a few hours on a standard desktop. Our solution

is fully automated and meets the time-to-market requirements.


Chapter 8

Conclusions

In this chapter, we present a summary of the thesis and outline possible extensions to

this work.

8.1 Thesis Summary

In this work, we presented methods and a framework to address the memory subsystem

optimization problem for embedded SoC.

In Chapter 3, we presented three different methods to address the data layout problem

for Digital Signal Processors. Multiple methods are required for addressing the problems

in the embedded design flow. For instance, data layout during memory architecture

exploration needs to be very fast, as data layout is used several thousand times to evaluate

different memory architectures. On the other hand, the data layout method used for final

system production needs to generate a highly optimal solution irrespective of the run-time.

Hence, we proposed three different approaches for data layout in Chapter 3: (i) Integer

Linear Programming (ILP) based approach, (ii) Genetic Algorithm (GA) formulation of

data layout and (iii) a fast and efficient heuristic method. We compared the results of all

the three approaches. The heuristic method performs very efficiently both in terms of the

quality of the data layout and also in terms of run-time. The quality of data layout (the

number of memory stalls reduced) generated by the heuristic algorithm is within 5% from

172 Conclusions

that of GA’s output. The ILP approach gives the best quality solution, but its run-time

is very high.

In Chapter 4, we addressed the logical memory architecture exploration problem for

embedded DSP processors. As discussed in Chapter 1, logical view is closer to the be-

havior and helps in reducing the search space by abstracting the problem. We formulated

the logical memory architecture exploration (LME) problem in multi-objective GA and

multi-objective SA. The multiple objectives include performance (in terms of memory

stalls) and cost (in terms of ”logical” memory area). Both GA and SA produce 100-250

Pareto-optimal design points for all application benchmarks. Our experiments showed

that the multi-objective GA performed better than SA approach in terms of (i) quality of

solutions in terms of the number of non-dominated solutions generated for a given time

and (ii) uniformly searching the design space (diversity of solutions). Both GA and SA

based approaches take approximately 30 minutes of run-time to generate Pareto-optimal

solutions for one benchmark.

Chapter 5 addressed the data layout exploration problem from a physical memory

architecture perspective. Again, the target memory architecture is for the embedded

DSP processors. We proposed a Multi Objective Data Layout EXploration (MODLEX)

framework that searches the data layout design space from a performance and power

consumption view point for a given physical memory architecture. We showed that our

method effectively uses the multiple memory banks, single/dual-ported memories, and

non-uniform banks to produce around 100-200 data layout solutions that are Pareto-

optimal with respect to performance and power consumed. We also showed that there

is a big 70% trade-off between power and performance possible by using different data

layout solutions; specifically for DSP based memory architectures.

In Chapter 6, we addressed the memory architecture exploration for embedded DSP

processors from a physical memory perspective. We proposed two different approaches to

physical memory exploration. First approach is extends the logical memory architecture

exploration described in Chapter 4 to address the physical memory architecture explo-

ration problem. This approach was referred as LME2PME. As part of the steps to extend

8.1 Thesis Summary 173

the LME to address PME, we proposed a memory allocation exploration framework that

takes the Pareto-optimal logical memory architecture and its corresponding data layout

as input and explores the physical memory space by constructing the given logical mem-

ory architecture with physical memories in different possible ways with the objective to

optimize area and power consumption. The memory allocation exploration is formulated

in multi-objective GA.

The second approach proposed in Chapter 6 is an integrated approach that directly

address the physical memory architecture exploration problem. This approach is known

as DirPME. This approach formulates the physical memory exploration problem directly

as a multi-objective GA. This approach works on data layout, memory exploration and

memory allocation simultaneously and hence the search space is very high as compared to

the LME2PME approach. We showed that both approaches, LME2PME and DirPME,

provide several 100s of Pareto-optimal points that are interesting from a area, power and

performance view point. Further, we showed for a given time LME2PME provides better

solutions than the DirPME approach. However, the solutions of DirPME and LME2PME

are very close and hence both approaches are useful depending on the needs of system

designers.

Finally, in Chapter 7 we extended our memory architecture exploration framework

to address SPRAM-cache based on-chip memory architecture. We proposed an efficient

data partitioning heuristic to partition data sections between on-chip SPRAM and cache.

A graph partitioning based cache conscious data layout heuristic is proposed with the

objective to reduce cache conflict misses. Exhaustive search method is applied to explore

SPRAM-cache design space. Each memory architecture is evaluated by mapping a target

application to the memory architecture under consideration, by using the data partition

heuristic and cache conscious data layout heuristic, to obtain the performance in terms

of number of memory stalls. We used eCacti [45] to obtain the area and power per

access numbers for the cache and used a semiconductor memory library to obtain the

area and power numbers for SPRAM. Based on the area, power and performance and by

applying the Pareto-optimal conditions, the list of Pareto-optimal memory architectures

174 Conclusions

are identified.

8.2 Future Work

In this section, we outline some of the possible extensions to our work.

8.2.1 Standardization of Input and Output Parameters

The memory architecture exploration problem, as discussed in Chapter 1 and illustrated

in Figure 1.4, has to be addressed at several levels, namely, behavioral level, logical archi-

tecture level, physical architecture level, and and data layout. To be able communicate

the interfaces and input/output parameters across these different levels of abstraction, it

is very critical to standardize the communication. This involves standardizing the input

and output file-formats/parameters from an IP, Platform, Semiconductor library view

point. Currently there is no standardization of format, syntax and semantics for these

parameters, which is aligned across these levels. It is very critical to address this problem

so that multiple methods/optimizations can be integrated seamlessly to address specific

applications, architectures and system aspects.

8.2.2 Impact of platform change on system performance

The impact of platform change on system parameters like area, power and performance

can be studied for a given application, a given semiconductor memory library and process

node. The impact analysis is critical to identify where to spend the effort in improving

the platform such that overall system performance improvement is high.

8.2.3 Impact of Application IP library rework on system per-

formance

We have addressed the memory architecture exploration problem for a given set of ap-

plications. At this stage, the make, buy or reuse decisions are made and the list of IP

8.2 Future Work 175

modules to be used as part of the system is known as shown in Figure 1.4. We could

extend our memory architecture exploration framework to analyze the impact of rework

or design improvement of one or more software IP on the memory power, performance and

area. This analysis could direct the IP optimization efforts properly with the objective

to improve system area, power and performance.

8.2.4 Impact of semiconductor library rework on the system

performance

Our memory architecture exploration framework can be extended to study the suitability

of a specific semiconductor memory library for a specific embedded system. Further, the

impact of rework of a semiconductor memory library from memory system area and power

can be studied to decide and prioritize the area of rework.

8.2.5 Multiprocessor Architectures

Our work on data layout and memory architecture exploration focuses mainly on optimiz-

ing the on-chip memory organization of a processor (DSP or Microcontroller) in a SoC.

Our work can be extended for optimizing shared memory subsystems in a multiprocessor

based SoC.

Bibliography

[1] ARM920T and ARM922T: ARM9 Family of Embedded Processors.

http://www.arm.com/products/CPUs/families/ARM9Family.html.

[2] ARM926EJ-S and ARM926E-S: ARM9E Family of Embedded Processors.

http://www.arm.com/products/CPUs/families/ARM9Family.html.

[3] lp solve.

http://lpsolve.sourceforge.net/5.5/.

[4] SystemC – Language for System-Level Modeling, Design and Verification.

http://www.systemc.org/home.

[5] Verilog Hardware Description Language.

http://www.verilog.com/index.html.

[6] International Technology Roadmap for Semiconductors, SEMATECH, 3101, Indus-

trial Terrace Suite, 106 Austin TX 78758., 2001.

[7] 2007 global mobile communications - statistics, trends and forecasts. Technical

report,

http://www.reportbuyer.com/telecoms/mobile/2007 global mobile trends.html,

2007.

[8] 1st IEEE Inter. Symposium on Industrial Embedded Systems. Panel Discussion.

Open Issues in SoC Design.,

http://www.iestcfa.org/panel discussions.htm, 2006.

BIBLIOGRAPHY 177

[9] G. Ascia, V. Catania, and M. Palesi. Parameterised system design based on genetic

algorithms. In Proceedings of ACM 2nd International Conference on Compilers,

Architectures and Synthesis for Embedded Systems (CASES), November 2001.

[10] O. Avissar, R. Barua, and D. Stewart. Heterogeneous memory management for em-

bedded systems. In Proceedings of ACM 2nd International Conference on Compilers,

Architectures and Synthesis for Embedded Systems (CASES), November 2001.

[11] F. Balasa, F. Catthoor, and H. De Man. Background memory area estimation for

multidimensional signal processing systems. IEEE Trans. VLSI system, 3:157–172,

June 1995.

[12] R. Banakar, S. Steinke, B-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad

Memory: A design alternative for Cache On-chip memory in Embedded Systems.

In Tenth International Symposium on Hardware/Software Codesign (CODES), Estes

Park, Colorado, May 2002. ACM.

[13] M. Barr. Embedded Systems Gallery.

http://www.netrino.com/Publications/Glossary/index.php.

[14] L. Benini, L. Macchiarulo, A. Macii, E. Macii, and M. Poncino. From architecture

to layout: Partitioned memory synthesis for embedded systems-on-chip. In Design

Automation Conference, 2001.

[15] L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. Layout driven memory synthe-

sis for embededed systems-on-chip. In Proceedings of ACM 3rd International Con-

ference on Compilers, Architectures and Synthesis for Embedded Systems (CASES),

November 2002.

[16] Broadcom,

http://www.broadcom.com/collateral/pb/2702-PB02-R.pdf. BCM2702: High Per-

formance Mobile Multimedida Processor, 2006.

178 BIBLIOGRAPHY

[17] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement.

In Eighth International Conference on Architectural Support for Programming Lan-

guages and Operating Systems, San Jose, California, October 1998.

[18] Y. Cao, H. Tomiyama, T. Okuma, and H. Yasuura. Data memory design considering

effective bitwidth for low energy embedded systems. In Proc. of the International

Symposium on System Synthesis (ISSS), 2002.

[19] F. Catthoor, N. D. Dutt, and C. E. Kozyrakis. How to solve the current memory

access and data transfer bottlenecks: at the processor architecture or at the compiler

level? In Design, Automation and Test in Europe Conference and Exhibition, pages

426–433, 2000.

[20] J.A. Chandy and P. Banerjee. Parallel simulated annealing strategies for VLSI cell

placement. In Ninth International Conference on VLSI Design, 1996.

[21] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache conscious structure definition.

In Proceedings of the 1999 ACM SIGPLAN/SIGBED conference on Languages, com-

pilers, and tools for embedded systems, May 1999.

[22] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache conscious structure layout.

In International Conference on Programming Languages Design and Implementation

(PLDI99), May 1999.

[23] K. Choi, R. Soma, and M. Pedram. Fine-grained dynamic voltage and frequency

scaling for precise energy and performance trade-off based on the ratio of off-chip

access to on-chip computation times. In Design, Automation and Test in Europe

Conference and Exhibition, volume I, 2004.

[24] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms.

MIT Press, 2001.

[25] K. Deb. Multi-objective evolutionary algorithms: Introducing bias among pareto-

optimal solutions. Technical report, IIT Kanpur, 1996.

BIBLIOGRAPHY 179

[26] W.W. Donaldson and R.R. Meyer. A dynamic-programming heuristic for regular

grid-graph partitioning. Technical report,

http://pages.cs.wisc.edu/ wwd/rev4.pdf, 2007.

[27] G. Dueck and T.Scheuer. Threshold accepting: A general purpose optimization

algorithm appear superior to simulated annealing. Journal of Computational Physics,

90:161–175, 1990.

[28] R. Fehr. Intellectual property: A solution for system design. In Technology Leadership

Day, October 2000.

[29] D. Gajski. Design methodology for systems-on-chip. Technical report, Centre for

Embedded Computer Systems, University of California, Irvine, California,

http://www.cecs.uci.edu/eve presentations.htm, 2002.

[30] D. E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning.

Addison-Wesley, 1989.

[31] P. Grun, N. Dutt, and A. Nicolau. Memory Architecture Exploration for Pro-

grammable Embedded Systems. Kluwer Academic Publishers, 2003.

[32] G. Hadjiyiannis, S. Hanono, and S. Devadas. ISDL: An Instruction Set Description

Language for Retargetability. In Proceedings of the Design Automation Conference

(DAC), June 1997.

[33] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Ap-

proach. Morgan Kaufmann Publishers, second edition, 1995.

[34] J. D. Hiser and J. W. Davidson. Embarc: an efficient memory bank assign-

ment algorithm for retargetable compilers. In Proceedings of the 2004 ACM SIG-

PLAN/SIGBED conference on Languages, compilers, and tools for embedded sys-

tems, pages 182–191. ACM Press, 2004.

[35] P. K. Jha and N. D. Dutt. Library mapping for memories. In EuroDesign, 1997.

180 BIBLIOGRAPHY

[36] B. Juurlink and P. Langen. Dynamic techniques to reduce memory traffic in embed-

ded systems. In Conference On Computing Frontiers, pages 192–201, 2004.

[37] M. Kandemir, J. Ramanujam, and A. Choudhary. Improving cache locality by a

combination of loop and data transformations. IEEE Transactions on Computers,

1999.

[38] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs.

The Bell System Technical Journal, 49(2):91–307, 1970.

[39] S. Kirkpatrick, C. D Gelatt, and M. P Vechi. Optimization by simulated annealing.

Science, 220, 1983.

[40] M. Ko and S. S. Bhattacharyya. Data partitioning for DSP software synthesis. In

Proceedings of the International Workshop on Software and Compilers for Embedded

Processors, September 2003.

[41] C. Kulkarni, C. Ghez, M. Miranda, F. Catthoor, and H. De Man. Cache conscious

data layout organization for embedded multimedia applications. In Design, Automa-

tion and Test in Europe, pages 686–691, 2001.

[42] Bernard Laurent and Thierry Karger. A system to validate and certify soft and hard

ip. In Design, Automation and Test in Europe Conference and Exhibition, 2003.

[43] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evalu-

ating and Synthesizing Multimedia and Communications Systems. In International

Symposium on Microarchitecture, 1997.

[44] R. Leupers and D. Kotte. Variable partitioning for dual memory bank DSPs. In

International Conference on Acoustics, Speech, and Signal Processing (ICASSP),

Salt Lake City (USA), May 2001.

[45] M. Mamidipaka and N. Dutt. eCACTI: An enhanced power estimation model for on-

chip caches. Technical report, Centre for Embedded Computer Systems, University

of California, Irvine, California, 2004.

BIBLIOGRAPHY 181

[46] P. Mishra, P. Grun, N. Dutt, and A. Nicolau. Processor-memory co-exploration

driven by a memory-aware architecture description language. In Proceedings of the

International Conference on VLSI Design, 2001.

[47] B. Monien and R. Diekmann. A local graph partitioning heuristic meeting bisection

bounds. In 8th SIAM Conference on Parallel Processing for Scientific Computing,

1997.

[48] H. Orsila, T. Kangas, E. Salminen, T. D. Hamalainen, and M. Hannikainen. Au-

tomated memory-aware application distribution for multi-processor system-on-chips.

Journal of System Architectcure: the EUROMICRO Journal, 53(11), November 2007.

[49] R. Oshana. DSP Software Development Techniques for Embedded and Real-Time

Systems. Embedded computer systems, 2006.

[50] K. V. Palem, R. M. Rabbah, V. J. Mooney III, P. Korkmaz, and K. Puttaswamy.

Design space optimization of embedded memory systems via data remapping. ACM

Conference on Languages, Compilers and Tools for Embedded Systems (LCTES),

June 2002.

[51] G. Palermo, C. Silvano, and V. Zaccaria. Multi-objective design space exploration

of embedded systems. Journal of Embedded Computing, 1(3), August 2005.

[52] M. Palesi and T. Givargis. Multi-objective design space exploration using genetic

algorithms. In International Workshop on Hardware/Software Codesign (CODES),

May 2002.

[53] P. R. Panda, N. D. Dutt, and A. Nicolau. Memory issues in Embedded Systems-on-

chip: Optimizations and Exploration. Kluwer Academic Publishers, Norwell, Mass.,

1998.

[54] P. R. Panda, N. D. Dutt, and A. Nicolau. Local memory exploration and optimization

in embedded systems. IEEE Trans. Computer-Aided design, 18(1):3–13, January

1999.

182 BIBLIOGRAPHY

[55] P. R. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. off-chip memory: The data

partitioning problem in embedded processor-based systems. ACM Trans. Design

Automation of Electronic Systems, 5(3):682–704, July 2000.

[56] S. Peesl, A. Hoffmannl, V. Zivojnovic2, and H. Meyrl. LISA - Machine Description

Language for Cycle-Accurate Models of Programmable DSP Architectures. In Design

Automation Conference, 1999.

[57] R. Rutenbar. Simulated annealing algorithms: an overview. IEEE Circuits and

Devices Magazine, January 1989.

[58] M. A. R. Saghir, P. Chow, and C. G. Lee. Exploiting dual data-memory banks

in digital signal processors. In Proceedings of the 7th Intl Conference Architectural

Support for Programming Languages and Operating Systems, pages 234–243, October

1996.

[59] A. Sangiovanni-Vincentelli, L. Carloni, F. De Bernardinis, and M. Sgroi. Benefits

and challenges for platform based design. In Design Automation Conference, 2004.

[60] E. Schmidt. Power Modelling of Embedded Memories. PhD thesis, 2003.

[61] H. Schmit and D. Thomas. Array mapping in behavioral synthesis. In Proc. of the

International Symposium on System Synthesis (ISSS), 1995.

[62] J. Seo, T. Kim, and P. Panda. An integrated algorithm for memory allocation

and assignment in high-level synthesis. In Proceedings of 39th Design Automation

Conference, pages 608–611, 2002.

[63] J. Seo, T. Kim, and P. Panda. Memory allocation and mapping in high-level synthesis:

an integrated approach. IEEE Transactions on Very Large Scale Integration (VLSI)

Systems, 11(5), October 2003.

[64] W.T. Shiue and C. Chakrabarti. Memory exploration for low power, embedded

systems. In Design Automation Conference, pages 140–145, New York, 1999. ACM

Press.

BIBLIOGRAPHY 183

[65] A. Shrivastava, I. Issenin, and N. D. Dutt. Compilation techniques for energy re-

duction in horizontally partitioned cache architectures. In Proceedings of ACM 6th

International Conference on Compilers, Architectures and Synthesis for Embedded

Systems (CASES), September 2005.

[66] K. Shyam and R. Govindarajan. Compiler directed power optimization for parti-

tioned memory architectures. In Proc. of the Compiler Construction Conference

(CC-07), 2007.

[67] J. Sjodin and C. Platen. Storage allocation for embedded processors. In Proceedings

of ACM 2nd International Conference on Compilers, Architectures and Synthesis for

Embedded Systems (CASES), November 2001.

[68] A.J. Smith. Cache memories. ACM Computing Surveys, 1993.

[69] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simu-

lation, and Control. Wiley, 2003.

[70] S. Sriram and S. S. Bhattacharyya. Embedded Multiprocecssors: Scheduling and

Synchronization. Embedded computer systems, 2000.

[71] A. Sundaram and S. Pande. An efficient data partitioning method for limited memory

embedded systems. In ACM SIGPLAN Workshop on Languages, Compilers and

Tools for Embedded Systems (in conjunction with PLDI ’98), pages 205–218, 1998.

[72] R. Szymanek, F. Catthoor, and K. Kuchcinski. Time-energy design space exploration

for multi-layer memory architectures. In Design Automation and Test Europe, 2004.

[73] Texas Instruments,

http://focus.ti.com/dsp/docs/. Code Composer Studio (CCS) IDE.


http://dspvillage.ti.com/. TMS320 DSP Algorithm Standard, 2001.

184 BIBLIOGRAPHY


http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C54x DSP CPU and

Peripherals Reference Set, volume 1 edition, 2001.


http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C55x DSP CPU Ref-

erence Guide, 2001.


http://dspvillage.ti.com/docs/dspproducthome.html. TMS320C64x DSP CPU Ref-

erence Guide, 2003.

[78] H. Tomiyama and N. D. Dutt. Program path analysis to bound cache-related pre-

emption delay in preemptive real-time systems. In Eighth International Symposium

on Hardware/Software Codesign (CODES), 2000.

[79] S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad

memory using compile-time decisions. ACM Tracsactions in Embedded Computing

Systems, 5:1–33, 2005.

[80] M. Verma, L. Wehmeyer, and P. Marwedel. Cache-aware scratchpad allocation algo-

rithm. In Proceedings of the conference on Design, automation and test in Europe -

Volume 2. IEEE Computer Society, 2004.

[81] L. Wehmeyer and P. Marwedel. Influence of memory hierarchies on predictability

for time constrained embedded software. In Design Automation and Test in Europe

(DATE), 2005.

[82] M. E. Wolf and M. S. Lam. A data locality optimizing algorihm. In Proceedings of

the 1991 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools

for embedded systems, June 1991.

[83] W. Wolf and M. Kandemir. Memory system optimization of embedded software.

Proceedings of the IEEE, 91(1), January 2003.

BIBLIOGRAPHY 185

[84] D.F. Wong, H.W. Leong, and C.L. Liu. Simulated Annealing for VLSI Design. Kluwer

Academic Publishers, 1988.

On-Chip Memory Architecture Exploration of Embedded System ... · PDF fileOn-Chip Memory Architecture Exploration of Embedded System on Chip A Thesis Submitted for the Degree of Doctor

Documents