Enhanced Synchronous Design Using Asynchronous Techniques€¦ · Enhanced Synchronous Design Using Asynchronous Techniques Navid Toosizadeh Doctor of Philosophy Graduate Department

Enhanced Synchronous Design Using

Asynchronous Techniques

by

Navid Toosizadeh

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright c© 2010 by Navid Toosizadeh

Abstract

Enhanced Synchronous Design Using Asynchronous Techniques

Navid Toosizadeh

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2010

As semiconductor technology scales down, process variations become increasingly difficult to

control. To cope with this, more and more conservative delay and clock frequency estimations

are used during design, which result in overly large and leaky circuits. Also, the system runs at

a speed slower than that possible because a fixed clock determined by the worst-case analysis of

the circuit is used. On top of process variations, voltage and temperature variations also push

the designer towards even more conservative delay estimations.

On the other hand, asynchronous design style has potential advantages over synchronous

design including resilience to process variations, lower power consumption and higher perfor-

mance. Unfortunately, these advantages are usually hindered by the significant design effort

required to implement useful asynchronous circuits and also by the overhead of asynchronous

control logic.

Borrowing from asynchronous techniques, a new methodology is proposed to design syn-

chronous circuits that have some of the advantages of asynchronous circuits. Asynchronous logic

is used to generate the clock of a synchronous system. The resulting system automatically tunes

itself to deliver the best-possible performance under the prevailing process-voltage-temperature

(PVT) conditions. This methodology may be used to reduce the leakage power significantly

in deep nanometer technologies. It also helps in handling process variations. The results from

a 32-bit processor implemented in 90nm technology shows 10X leakage reduction compared to

ii

the traditional synchronous design.

The proposed technique is expanded to adjust the speed of a pipeline according to the

current operations flowing in the pipeline as well as the current PVT conditions. The results

from a 32-bit processor in 90nm technology demonstrate a 2X speed improvement compared

to the conventional synchronous design. The proposed techniques only use synchronous design

tools and are compatible with design flows that are currently in use.

iii

Acknowledgements

I would like to thank my thesis supervisor, Professor Safwat Zaky, for his guidance

and patience over the years. His extensive and broad knowledge has given me a solid foundation

to depend on, while his dedication to the scientific process has improved the quality of my work.

He has gone beyond the call of duty to help me become a better researcher. On a personal

level, I have learned a great deal from him. I consider him to be my mentor.

To Professor Jianwen Zhu, I extend my gratitude for the guidance he offered during the

course of my studies. As a member of my committee he has always provided practical advice

and shared experience with me. I am grateful for his contribution to my research.

I would like to thank Professor Zvonko Vranesic for his contribution as a member of my

committee. His feedback was invaluable in improving the quality of my research.

I also take this opportunity to thank Professor Roman Genov. His valuable feedback cer-

tainly improved the quality of my thesis.

I would like to acknowledge the financial support provided by the Natural Sciences and

Engineering Research Council of Canada, University of Toronto and the Government of Ontario.

To my many friends from SF 2206 and other graduate offices, I extend my gratitude for

their friendship and support during the course of my Ph.D. Many thanks to Kamran and Sogol.

To my brothers and sisters, I extend my heartfelt gratitude for their emotional support

while being away from my home country. Mahnaz, Saeed, Nima and Sharareh, I have always

felt your support. Many thanks to my family members Mohsen, Mojgan, Alireza, Keihan,

Hooman, Sadaf, Kiarash, Mandana, Hamid and my parents-in-law Mehdi and Sareh.

My parents, Hossein and Masoumeh have always encouraged me and provided the best

support for me. I am always grateful to them for their unconditional love.

Above all though I am eternally grateful to my best friend and wife, Reihaneh, who has

so greatly changed the course of my life. She has patiently supported my many busy evenings

and weekends while I completed my academic work. I hope I can provide her with the love and

happiness she has unselfishly given me.

iv

Contents

List of Tables ix

List of Figures xi

1 Introduction 1

1.1 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Research Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Challenges of Today’s Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Asynchronous Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 History of asynchronous systems . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Asynchronous Styles and Building Blocks . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Asynchronous handshake protocols . . . . . . . . . . . . . . . . . . . . . . 11

2.4.2 C-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.3 Classes of asynchronous circuits . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Asynchronous Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.1 Micropipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.2 MOUSETRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Globally Asynchronous Locally Synchronous Systems . . . . . . . . . . . . . . . . 18

v

2.7 Potential Asynchronous Design Advantages . . . . . . . . . . . . . . . . . . . . . 19

2.7.1 Avoiding clock skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7.2 Lower electromagnetic noise . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7.3 Lower power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7.4 Resilience to process and environmental variations . . . . . . . . . . . . . 21

2.7.5 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.6 Higher performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.8 Asynchronous Design Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.9 Difficulties in Asynchronous Design . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.10 Attacking Asynchronous Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.10.1 Desynchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.10.2 Methods using conventional HDLs . . . . . . . . . . . . . . . . . . . . . . 28

2.10.3 Methods using communicating sequential processes languages . . . . . . . 29

2.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Application of Concurrency in Asynchronous Design 33

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Handshake circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.2 Sequencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 WAR Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Using Edge-Triggering in Accumulator Circuits . . . . . . . . . . . . . . . . . . . 41

3.5 Introducing Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6 System Timing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6.1 System examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Experimental Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . 53

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Enhanced Synchronous Design 58

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

vi

4.2 PVT-aware Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Solving Real-world Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Leakage Reduction 64

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 Review of Power Management Techniques . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 The PVT-aware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.5.1 Clock generation issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.6 Case Study: PVT-aware DLX Microprocessor . . . . . . . . . . . . . . . . . . . . 73

5.6.1 Tuning delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.6.2 Implementing the fixed-clock counterpart . . . . . . . . . . . . . . . . . . 74

5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.7.1 Power and performance analysis . . . . . . . . . . . . . . . . . . . . . . . 76

5.7.2 Resilience to inter-chip PVT variations . . . . . . . . . . . . . . . . . . . . 78

5.7.3 Resilience to intra-chip PVT variations . . . . . . . . . . . . . . . . . . . 78

5.7.4 Suitability for voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.8.1 Design space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.8.2 Clock error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.8.3 Expanding the PVT-aware approach . . . . . . . . . . . . . . . . . . . . . 82

5.9 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 VariPipe: Variable-clock Synchronous Pipelines 88

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 VariPipe: The Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.3 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.3.1 Creating delay profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vii

6.3.2 Simplifying delay profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.3.3 Implementing the clock generation circuit . . . . . . . . . . . . . . . . . . 92

6.3.4 Variable delay implementation . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.6 Case Study: VariPipe DLX Microprocessor . . . . . . . . . . . . . . . . . . . . . 96

6.6.1 Implementing the VariPipe DLX processor . . . . . . . . . . . . . . . . . 96

6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.7.2 Energy consumption analysis . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.7.3 Area and energy overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.7.4 Resilience to PVT variations . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.7.5 Reduction in electromagnetic noise . . . . . . . . . . . . . . . . . . . . . . 106

6.7.6 Suitability for voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Conclusion and Future Work 111

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.1.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

A Previous Publications 115

B Balsa Code for Radix-4 Booth Multiplier 116

C Chip Layout of the PVT-aware Processor 119

Bibliography 121

viii

List of Tables

3.1 Overlapped delays for different insertion points . . . . . . . . . . . . . . . . . . . 50

3.2 Delay values in ps for the accumulators inside the multiplier . . . . . . . . . . . . 51

3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Area of storage and control elements . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 PVT corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Post-synthesis power and area breakdown . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Comparison of the clock period and the critical path . . . . . . . . . . . . . . . . 74

5.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.6 Power and performance results under typical PVT,temp=25◦C . . . . . . . . . . 76

5.7 Post-layout area and leakage breakdown under typical PVT corner,temp=25◦C . 77

5.8 Power and performance results under worst-case PVT,temp=125◦C . . . . . . . . 78

5.9 Clock period changes with intra-chip variations . . . . . . . . . . . . . . . . . . . 79

6.1 Toolset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2 Operation Selection Table of Execution Unit . . . . . . . . . . . . . . . . . . . . 98

6.3 Operation Selection Table of Decoder . . . . . . . . . . . . . . . . . . . . . . . . 99

6.4 Post-layout delay profiles of Decoder and Execution unit . . . . . . . . . . . . . . 100

6.5 Simplified delay profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.6 PVT corners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

ix

6.8 Execution time reduction percentage using VariPipe . . . . . . . . . . . . . . . . 104

6.9 Energy consumption under the typical PVT corner . . . . . . . . . . . . . . . . . 105

x

List of Figures

2.1 Handshake data encoding schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Handshake styles [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 C-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Micropipeline [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 MOUSETRAP [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Epson’s flexible processor [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Desynchronization method [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.8 Balsa code of a single-place buffer [6] . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.9 Handshake circuit of the single-place buffer [6] . . . . . . . . . . . . . . . . . . . 30

3.1 Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Comparison of S-sequencer and T-sequencer behavior . . . . . . . . . . . . . . . . 36

3.3 T-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 An example of the WAR hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Edge-triggered-based variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6 Five-stage FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 Current synthesis of Dst← f(Dst, Src) in Balsa . . . . . . . . . . . . . . . . . . 41

3.8 Revised accumulator circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.9 Timing diagram for the circuit in Figure 3.8 . . . . . . . . . . . . . . . . . . . . . 43

3.10 Inserting a T-element in the Func channel . . . . . . . . . . . . . . . . . . . . . . 44

3.11 The T-isolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.12 Timing diagram for the circuit in Figure 3.10 . . . . . . . . . . . . . . . . . . . . 45

xi

3.13 Inserting a T-element in the Write channel . . . . . . . . . . . . . . . . . . . . . 47


3.15 Inserting a T-element in the Act channel . . . . . . . . . . . . . . . . . . . . . . . 48


3.17 Accumulation loops inside the multiplier . . . . . . . . . . . . . . . . . . . . . . . 52

3.18 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Clock generation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Clock generation loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Clock generation circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Clock generation circuit schematic . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Proposed low-power PVT-aware design flow . . . . . . . . . . . . . . . . . . . . . 71

5.4 Performance of PVT-aware and fixed-clock DLX processors under all PVT corners 78

5.5 Design Space expansion using PVT-aware design . . . . . . . . . . . . . . . . . . 80

5.6 Dean’s clocking structure [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.7 Razor flip-flop for a pipeline stage. (a) A shadow latch controlled by a delayed clock augments each flipflop.

6.1 VariPipe technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 Clock generation circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Variable delay and toggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4 Reducing the switching power of delay element . . . . . . . . . . . . . . . . . . . 94

6.5 A simplified model of the clock generation circuit . . . . . . . . . . . . . . . . . . 94

6.6 Proposed VariPipe design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.7 Verilog code of the Execution unit . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.8 Performance of VariPipe and fixed-clock DLX processors under all PVT corners . 103

6.9 Comparison of the clock power spectra . . . . . . . . . . . . . . . . . . . . . . . . 107

C.1 Chip layout of the PVT-aware processor . . . . . . . . . . . . . . . . . . . . . . . 120

xii

Chapter 1

Introduction

Semiconductor technology scaling has introduced new challenges in the design and im-

plementation of synchronous systems such as coping with process variations. Process

variations push designers to use more conservative delay estimations, which result in

overly large, leaky and slow circuits. On top of process variations, voltage and tempera-

ture variations also push the designer towards even more conservative delay estimations.

On the other hand, asynchronous systems exhibit properties that can be of great bene-

fit to today’s implementations, including lower power consumption, better adaptability

to process and environmental variations and better performance. Asynchronous circuits

are capable of adjusting their speed to the present input data and to operating condi-

tions such as temperature and voltage, while a conventional synchronous design always

assumes the longest possible delay.

Despite the advantages of asynchronous design, building useful asynchronous circuits

is burdened by the difficulties in their design and implementation. In this dissertation,

first, a methodology to increase concurrency in the operation of asynchronous circuits

is suggested, which results in faster, smaller and less energy consuming asynchronous

circuits.

However, real-world applications are mainly implemented as synchronous circuits, us-

ing well-established synchronous design tools. The main thesis of this work is that asyn-

1

Chapter 1. Introduction 2

chronous techniques can be applied to synchronous circuits, to considerable advantage. A

methodology is proposed to design synchronous circuits with asynchronous advantages.

The methodology enhances real-world design and applications with asynchronous advan-

tages making possible higher speed, smaller area, less power consumption and better

flexibility to process, environmental and operating conditions.

1.1 Thesis Motivation

As technology scales to smaller feature sizes, process variations make the quality of fab-

ricated chips less predictable. Therefore, more and more conservative delay estimations

are used, which result in overly complex and large circuits with excessive power require-

ments. The variations in other parameters such as voltage and temperature also oblige

more conservative delay estimations. The clock frequency of traditionally-designed syn-

chronous systems is determined using the worst-case process-voltage-temperature (PVT)

analysis of the most critical path. A system is most of the time exposed to typical PVT

conditions and in many cases, its critical path is not triggered. Under these conditions,

the system can run at a higher speed than that determined using worst-case analysis. In

summary, the combination of PVT variations and traditional synchronous design style

results in excessively large, power-consuming and slow circuits.

Today’s applications demand more functionality and performance resulting in a con-

stant increase in power consumption. For example, more and more functions are added

to battery-limited handheld devices. How is it possible to meet all these requirements

with available technologies?

Asynchronous (clock-less) design exhibits features that can be of great use to diminish

today’s design challenges. These features are studied comprehensively in Chapter 2 along

with the definition of asynchrony and asynchronous design styles. Potential advantages

of asynchronous design over synchronous design include:

• Resilience to process and operating conditions


• Lower power consumption

• Higher performance

The features of asynchronous design are in line with the design challenges mentioned

earlier. The objective of this work is to find a way to put asynchronous advantages into

real-world use. In other words, to introduce a practical methodology to use asynchronous

advantages in solving the problems introduced earlier such as coping with process varia-

tions, lowering power consumption and improving performance.

Unfortunately, designing asynchronous circuits is difficult and the circuits synthesized

by asynchronous tools are large and slow. This is one of the reasons why asynchronous

design style has not been accepted as a mainstream approach by the industry. The

challenges in the design and implementation of asynchronous circuits are studied in the

next chapter.

1.2 Research Flow

The first step in the work presented in this dissertations was to study available tools and

techniques to design asynchronous systems. The main goal was to understand some of

the limitations of asynchronous circuits and also to study the resulting products. There

are two main groups of asynchronous design tools:

1. Tools such as Balsa [9] and Haste (Tangram) [10–12] have been developed, which

use their own specific hardware description language (HDL) to describe purely asyn-

chronous systems. These tools convert the design into Verilog and then use third-

party tools to synthesize and lay out the circuit.

2. Tools that use conventional HDL (e.g., Verilog or VHDL) to describe a synchronous

design and then convert it into an asynchronous circuit (desynchronization). The

circuit is then synthesized and laid out using dominant Electronic Design Automa-

tion (EDA) tools.


First, Balsa was used to implement various test circuits. This experience led to several

observations. The main difference between an asynchronous circuit and a synchronous

one is the way timing is realized. Synchronous design uses a single global timing clock

signal. By contrast, asynchronous design uses local handshakes between modules to time

their communication and to transfer data. When a module has new data to transfer, it

sends a request signal and when the receiver is ready, data are transferred. This allows the

system to work at the highest possible rate. Different data require different processing

time; asynchronous systems are capable of adjusting their speed with the input data.

Also, asynchronous circuits can adjust their speed with the operating conditions such

as voltage and temperature. Hence, asynchronous operation is not bound by worst-case

assumptions, but determined by the average-case processing delay of the input data and

average-case operating conditions. This feature although appealing, comes at a cost. The

handshake control circuits exist all around an asynchronous circuit to control the timing.

This leads to two main problems:

1. Control circuits take area and consume power, especially leakage power.

2. In many cases, the performance is limited by the handshake control circuitry. That

is, the delay of the control circuit is so significant that it may become the bottleneck

of the system.

The desynchronization method generates faster and smaller circuits compared to

Balsa. However, the handshake control circuits are still large and slow. Balsa and

desynchronization are studied in the next chapter.

In this thesis, an optimization is proposed and tested using Balsa to increase the

concurrency in asynchronous circuits involving write-after-read (WAR) operations. It is

shown that handshakes can be overlapped to achieve a higher performance. However,

because the resulting circuit is still purely asynchronous, the control circuit limits the

performance significantly.


This experience made it clear that the main problem of asynchronous circuits is that

their advantages are burdened by the control circuit implementation. By contrast, syn-

chronous circuits have a minimal timing control mechanism. The main control signal in

synchronous circuits is the clock signal. Therefore, the synchronous design approach was

adopted as the starting point. Then, asynchronous design techniques were leveraged to

build the clock signal for the synchronous system. The combination is a new approach

to design and implement synchronous circuits with asynchronous advantages. The asyn-

chronous clock generation circuit introduces a very small area and power consumption

overhead. The proposed approach only uses dominant synchronous design synthesis,

timing and layout tools. Therefore, it can be used in many applications.

The resulting system using the proposed methodology is a synchronous circuit that

automatically tunes itself to deliver the best-possible performance with process-voltage-

temperature (PVT) variations. The methodology mitigates the conservative assumptions

involved in the synchronous design process, resulting in significantly smaller and less leaky

circuits.

The results of using the suggested approach on a 32-bit microprocessor are promising,

demonstrating that the methodology may be used as an alternative to purely synchronous

or purely asynchronous approaches. Designers can use the proposed approach to achieve

significant power reductions or performance improvements in many high-speed applica-

tions. The designer does not need to worry as much about delays and variations because

on-chip circuitry auto-correct for these variations. This should reduce the design and

computer aided design (CAD) complexity.

The proposed methodology is then expanded to implement variable-speed pipelines.

A variable clock period is generated that changes cycle-by-cycle according to the current

operations in the pipeline and the current process-voltage-temperature (PVT) conditions.

The resulting system is much faster than its fixed-clock counterpart and produces less

electromagnetic noise.


1.3 Thesis Contributions

The main contributions of this thesis are:

1. A methodology to increase concurrency and enhance the asynchronous synthesis of

the circuits that involve write-after-read operations.

2. A low-overhead design methodology to implement synchronous circuits with asyn-

chronous advantages. The resulting circuit adjusts its speed to the prevailing PVT

conditions. The proposed methodology is accompanied by a compete design flow

using standard cells and dominant EDA tools. The methodology reduces the leak-

age power of high-speed applications in the deep nanometer regime and mitigates

process variations.

3. The proposed methodology is expanded to design variable-clock synchronous pipelines

that adjust their speed to the current operations in the pipeline as well as current

PVT conditions. The resulting system has a higher speed and lower electromagnetic

emissions. The overhead of the added clock generation circuit is significantly lower

than in previous work.

1.4 Thesis Organization

The remainder of this dissertation is organized as follows: Chapter 2 provides the re-

quired background. The first contribution of the thesis is explained in Chapter 3, where

several test circuits synthesized using Balsa are examined and optimized by the proposed

technique. Chapter 4 introduces the new design methodology to implement synchronous

circuits with asynchronous advantages, which is then used in Chapter 5 for leakage re-

duction. The proposed technique is expanded further in Chapter 6 to design pipelines

with a variable clock that adjusts to the current operations in the pipeline and current

PVT conditions. Chapter 7 presents concluding remarks and suggestions for future work.

Chapter 2

Background

2.1 Introduction

This chapter presents the background material that forms the basis for the research

presented in later chapters. This dissertation demonstrates that many challenges in the

design of synchronous systems in today’s nanometer regime are mitigated by employing

asynchronous techniques. First, these challenges are discussed and then, asynchronous

design style is reviewed.

Asynchronous design style is introduced followed by handshake styles to implement

asynchronous circuits. Next, C-element, an important building block of asynchronous

circuits is described. Asynchronous pipelines and globally asynchronous locally syn-

chronous (GALS) systems are explained because they are referred to later. Potential

asynchronous design style advantages are discussed, followed by asynchronous applica-

tion examples. These examples demonstrate the usefulness of asynchronous design in

the real world. Challenges of asynchronous design are explained next, followed by earlier

work that attack some of the difficulties in the design and implementation of asynchronous

design.

7

Chapter 2. Background 8

2.2 Challenges of Today’s Technologies

Parameter variations include process variations due to manufacturing phenomena, volt-

age variations due to both manufacturing and runtime phenomena, and temperature

variations due to varying activity and power consumption levels. Process variations

manifest themselves as die-to-die, within-die and wafer-to-wafer variations. Tempera-

ture and voltage variations are dynamic and change with the circuit operation mode

and the activity. These parameters are collectively referred to as PVT (process-voltage-

temperature). Among all these variations, process variations are becoming increasingly

severe as technology scales down to smaller feature sizes. [13].

The variations in process, voltage and temperature makes it difficult to achieve the

desired performance and power consumption. First of all, the maximum clockable fre-

quency of the system is determined by the worst-case analysis of the critical path. This

limitation is addressed in Chapter 6, where a methodology to increase performance with-

out changing the design architecture is presented.

Another design challenge in today’s technologies is the power consumption. With

technology scaling, more transistors are packed on the same chip, increasing the power

consumption. Power supply is decreased to reduce the power consumption and accord-

ingly the threshold voltage of transistors is decreased. A reduction in the threshold

voltage results in a significant increase in the leakage power. The subthreshold current

of CMOS transistors, which is the dominant leakage current component in sub-100nm

technologies, is given by (2.1), where I0 and η are constants, VT is the thermal voltage

and Vth is the threshold voltage. According to this equation, the subthreshold current

increases exponentially with a decrease in the threshold voltage. A comprehensive study

of leakage power physics may be found in [14].

I subth = I0e−VthηVT (2.1)

Average dynamic power mainly depends on the switching activity of the circuit, the


size of the design, and the supply voltage as shown by (2.2) [15], where C is the load

capacitance of a CMOS gate, f is the clock frequency, α is the gate switching activity

and V is the supply voltage.

P Dynamic = CV 2fα (2.2)

Several power reduction techniques are studied in Chapter 5, along with a new

methodology to reduce both leakage and dynamic power.

2.3 Asynchronous Circuits

A circuit is asynchronous when a clock is not used to time the operations of the circuit.

Instead, different modules inside the circuit send completion signals to each other to

indicate the completion of an operation and to request new data. The signals between

different modules are usually referred to as handshake signals. As opposed to a syn-

chronous circuit, the timing between modules is directed by the actual operations of the

system instead of a predetermined timing source.

2.3.1 History of asynchronous systems

The field of asynchronous systems is both old and young. The 1952 ILLIAC at the

University of Illinois had both synchronous and asynchronous parts [16]. The 1960 PDP6

from DEC was asynchronous [17]. Among the pioneers in this field who have developed

much of the theoretical framework are Huffman [18] and Muller [19]. They introduced

Huffman’s circuits and Muller’s circuits, along with basic asynchronous modules. Molnar

at Washington University focused on metastability [20].

Even though asynchronous logic has never disappeared, when clocked techniques of-

fered an easy way for hiding hazards and timing complications, clock-less logic was forgot-

ten until the end of 1970s. The Caltech Conference on VLSI in 1979 contained a complete

session on self-timed circuits. Sutherland’s efforts, especially his award-winning paper

micropipelines [2], kept the field of asynchronous logic alive.


The first synthesis method for asynchronous logic appeared around mid-1980 with the

Caltech program-transformation approach [21]. Since that time, there have been other

synthesis systems such as Philips Tangram, and University of Manchester’s Balsa [22,

23]. New hardware description languages have also been developed. Some research on

asynchronous design has been done in the industry, such as the work by Epson. Epson

added asynchronous concepts to Verilog and called it Verilog+. This language has its

own compiler to deal with asynchronous designs [4].

The first single-chip asynchronous processor was designed at Caltech in 1988 [24].

It was followed by Amulet from the University of Manchester in 1993 [25]. There have

been other processors such as TITAC from Tokyo Institute of Technology [26]. Min-

iMIPS was probably the fastest asynchronous processor at that time. It is a 32-bit MIPS

R3000 microprocessor implemented by Caltech in 1997 [27]. Epson introduced its 8-bit

asynchronous microprocessor in 2005 which can be used in wearable devices [4].

In summary, the asynchronous world has been actively creating low-power, self-timed,

low-radiation, high-performance and adjustable systems in the academic and industrial

environments.

2.4 Asynchronous Styles and Building Blocks

The clock is not used to implement timing in asynchronous systems [28]. Local controllers

replace the global clock to time the operation of modules and their communications.

It is essential to conceive how these communications work in order to implement real

asynchronous systems. In this section, asynchronous handshake protocols such as the

return-to-zero method will be introduced. Also, different classes of asynchronous circuits

are presented. This is followed by studying an important asynchronous element called

C-element.


2.4.1 Asynchronous handshake protocols

In asynchronous systems, there is no clock to specify the rate at which data between

modules should be transferred. Instead, communicating modules use a group of signaling

wires called handshakes. These signals are local to each pair of communicating modules.

Compared to synchronous design, handshakes are activated only when there are new data

to transfer, whereas in synchronous design, clock pulses are generated and distributed

whether there are new data or not.

A handshake channel is formed between two communicating modules by request and

acknowledge signals. The channel may be a push channel in which, the sender sends

data and activates the request signal to indicate data are ready. On the other hand,

the receiver accepts the data and activates the acknowledge signal to indicate data have

been received. By contrast, in a pull channel, the receiver initiates the handshake by

sending a request signal to the sender and the sender replies back by sending data and

activating the acknowledge signal. Handshakes may be implemented by different styles

and the data can be encoded in different ways as explained next.

2.4.1.1 Data encoding

There are two mostly-used types of data encoding for handshaking: 1) Bundled (single-

rail) data 2) 1-of-N data encoding, which are shown in Figure 2.1.

In the bundled data protocol, data and handshakes have separate lines. It is also

called single-rail handshakes to be distinguished from the dual-rail protocol discussed

next.

In the 1-of-N encoding protocol, there are N data lines and one acknowledge signal.

These N data lines are used to transfer log2(N) bits of data. For example, 8 lines are

used to transfer 3 bits of data. Similarly, two data lines are used to transfer one bit

of data. The latter case is the dual-rail data protocol. The value of each bit, which is

either 0 or 1, is realized with two separate wires. Therefore, if the wire showing value 0

is activated, the corresponding data value is 0. Similarly, if the other one is activated,


(a) Bundled (single-rail) data protocol

(b) 1-of-N data encoding

Figure 2.1: Handshake data encoding schemes

the data value is 1. Otherwise, there are no valid data on the channel.

There are also some other data encoding schemes that have been used, such as the

level encoded dual-rail protocol. Similar to the dual-rail encoding, there are two wires

for each bit, namely data and phase. The data value is equal to the value of the bit being

encoded. In cases that two consecutive data tokens are the same, the phase wire changes

value to distinguish between different data.

2.4.1.2 Handshaking styles

Handshake signals can be four-phase (return-to-zero) or two-phase (non-return-to-zero).

In the four-phase style, request and acknowledge signals are activated and deactivated

by signal levels. For example, a high on request or acknowledge signal shows that it is

activated (up phase). To be activated again, the signal must first return to zero (down

phase). This style is shown in Figure 2.2a.

In the two-phase protocol, request and acknowledge are edge activated as shown in

Figure 2.2b. Therefore, both rising and falling edges of handshake signals show a new

activity on the related signal. There is no need for the handshake signal to return to zero

before the next activation and thus, this style is called non-return-to-zero. The falling


and rising activities on control signals are called events. The two-phase style can only be

used in bundled data communication. In 1-of-N data encoding schemes, there are more

than one request signal and thus, it is not possible to use the two-phase style.

(a) Four-phase

(b) Two-phase

Figure 2.2: Handshake styles [1]

2.4.1.3 Comparison of asynchronous handshake protocols

Bundled data protocol is more popular than dual-rail, because it consumes less area and

tends to be faster. However, matching the delay between data and control lines needs

more effort. If this is not done properly, control signals will race with data signals and

hazards occur. By contrast, presence of data in two-phase protocols implies a request to

the receiver. Consequently, there is no delay between the data and control signals. This

protocol is used to implement delay insensitive circuits.

Two-phase systems should be faster because there are no return-to-zero phases. How-

ever, the circuits required to implement this protocol are more complicated and thus,


slower. Simple and fast circuit implementations have been developed for the dual-rail

protocol, using early evaluation to speed up handshakes. More details can be found

in [29] and [30].

2.4.2 C-element

The Muller C-element is a commonly used asynchronous logic component originally de-

signed by David E. Muller. The main feature of the C-element is hysteresis. It has

memory that keeps its output state until all of its inputs change to the same state. At

that point, the output becomes equal to the inputs. The output remains in this state

until all the inputs switch to the opposite state. There are other varieties of C-elements

such as asymmetric C-element, where some inputs only affect the operation in one of the

rising or falling transitions. Figure 2.3 shows the symbol, gate-level and transistor-level

design of the C-element.

(a) Symbol (b) Gate level (c) Transistor level

Figure 2.3: C-element

It should be noted that C-element is one form of rendezvous circuits, which are used

to indicate when the last of two or more signals has arrived at a particular stage [31].

C-element provides the AND function for events in the two-phase protocol. Another form

of rendezvous circuits is GasP [32].


2.4.3 Classes of asynchronous circuits

Asynchronous circuits are classified according to the restrictions on their design. The

more stringent restrictions result in fewer probable hazards. On the other hand, the

more relaxed restrictions result in simpler and faster circuits.

One class of asynchronous circuits is Delay Insensitive (DI) circuits [33]. These circuits

operate correctly regardless of the delays on their gates and wires. Here, unbounded

delays are assumed. This is a hard restriction to satisfy and it has been proven that not

many useful DI circuits can be built. Circuits composed of only inverters and C-elements

can be DI.

Quasi-Delay-Insensitive (QDI) circuits are delay insensitive except that isochronic

forks are permitted [34]. It means that a bounded skew is allowed between different

branches of the circuit. This class of circuits has been used more than other classes in

real implementations.

A Speed-Independent (SI) circuit is a circuit that operates correctly regardless of

gate delays. Wire delays are neglected or assumed to be zero [35]. Self-timed circuits are

made up of elements that have their own local timings but at their interface, they are

all delay insensitive. Scalable-Delay-Insensitive (SDI) circuits are very similar to QDI

circuits. However, SDI assumes that the relative delay ratio between two components is

bounded. This helps in designing simpler and faster practical circuits in comparison to

QDI circuits [26].

2.5 Asynchronous Pipelines

Many large sequential circuits are organized as pipelines. Pipelines are used for dividing

total work among different modules, which are kept busy processing different data in the

input queue. There are several pipelines such as micropipelines [2], MOUSETRAP [29],

QDI pipelines [36, 37], asP* [38], GasP [32], wave pipelines [39, 40], and surfing [41].

Mircopipelines are studied here because they are the basis for the clock generation circuits


proposed in Chapters 5 and 6. MOUSETRAP is also studied because it is a low-overhead

high-speed asynchronous pipeline design approach, comparable to the one proposed in

Chapter 6.

2.5.1 Micropipelines

Micropipelines were the first asynchronous pipelines, invented by Sutherland in 1989 [2].

A basic Micropipeline is shown in Figure 2.4. It is based on the two-phase handshaking

protocol. The C-elements are used to compare the state of each stage with the next

one. The difference in states means that the current stage is full and the next stage is

empty and therefore data can be transferred. Cd (Capture done) and Pd (Pass done)

are simply delayed version of C (Capture) and P (Pass) signals. Storage elements in

micropipelines are event-controlled elements, which are composed of two side by side

latches. These latches are activated alternatively to generate similar responses to rising

and falling events.

Figure 2.4: Micropipeline [2]

A drawback of micropipelines is that their event-driven storage elements are complex

and slow. However, their control circuit is elegant and is a reference for many other works


in the field [32].

2.5.2 MOUSETRAP

In 2001, Singh and Nowick introduced an asynchronous pipeline control mechanism called

MOUSTRAP, which stands for Minimal-Overhead Ultra-high-SpEed Transition-signaling

Asynchronous pipeline [29]. The importance of MOUSETRAP is that 1) unlike mi-

cropipelines, it uses conventional latch storage elements. 2) The control circuit overhead

is minimal and thus, high-performance pipelines may be realized. As shown in Figure 2.5,

the MOUSETRAP design is based on two-phase bundled-data protocol.

Fig. 4. MOUSETRAP pipeline with logic processing.

Figure 2.5: MOUSETRAP [3]

Each pipeline stage has a data latch and a latch controller. The data latch is a simple

transparent latch with a default state of being transparent, allowing new data to pass

through quickly. The latch controller enables and disables the data latch. It consists of

only one XNOR gate with two inputs: done from the current stage N and ack from stage

N + 1. Since the pipeline is designed to use a two-phase protocol, the XNOR gate acts

as a phase converter. It converts the transitions on wires done and ack into level that

controls the corresponding latch.

The operation of MOUSTRAP is as follows. Assume that the latch in Stage N is

transparent. A change in the reqN signal means that a new data item is present at the


stage. The data are latched in the data latch and at the same time, reqN signal is latched

by the control latch. As a result, the doneN signal changes state causing the output of

the XNOR gate to become zero, closing the latch. The doneN signal is delayed by a

fixed delay equal to the worst-case logic processing of stage N . The resulting signal is

reqN+1. If the next stage is ready to accept new data, the new data and control are

latched causing the ackN signal to change state. This changes the output of the XNOR

gate of stage N and opens the data latch of that stage to accept new data.

2.6 Globally Asynchronous Locally Synchronous Systems

A Globally Asynchronous Locally Synchronous (GALS) system is a mixture of syn-

chronous blocks and asynchronous interfaces. It utilizes advantages of synchrony and

asynchrony. GALS systems were first introduced by D. Chapiro in his PhD thesis in

1984 [42]. GALS is studied here because it can be used in combination with the design

approach presented in Chapters 5 and 6.

Each block in a GALS system is designed as synchronous and its interface is asyn-

chronous. Thus, different blocks interact asynchronously without using a global clock,

reducing clock skew problems. Therefore, GALS architectures are suitable for System

on Chip (SoC) and Network on Chip (NoC). They inherit some benefits of asynchronous

systems such as power and emission reduction and modularity. In short, synchronous

conventionality and asynchronous modularity are combined in GALS systems.

The advantages of GALS systems may not outweigh the additional effort required to

implement such systems. First of all, industries cannot directly utilize GALS because

most engineers are not familiar with asynchronous concepts. Also, the design process to

implement a GALS system is difficult. There is no known algorithm that can be used to

partition a system into different clock regions and automatic place and route tools are

not available for GALS [43].


2.7 Potential Asynchronous Design Advantages

In this section the advantages of asynchronous design are summarized and discussed.

There are many challenges in implementing applications as asynchronous systems. There-

fore, it is necessary to understand why asynchronous design should be considered. Asyn-

chronous circuits can be better than synchronous counterparts in many ways, which

include lower power, lower emission level, modularity, and better adaptability to process

variations. Real-world examples that demonstrate these advantages are presented in the

next section.

2.7.1 Avoiding clock skew

System on Chip (SoC) architectures are becoming increasingly complicated. Feature sizes

are decreasing and more functionality is added to the chip. At the same time, higher

frequencies are used. With more clocked modules at high frequencies, the circuits are

more sensitive to clock skew. The problem of clock skew becomes more serious as the

technology advances and the frequency increases.

Many efforts have been made to solve this problem. One of the proposed solutions

is to use clock-less systems. Asynchronous systems do not have a global clock and thus,

they do not suffer from the clock skew.

A branch of asynchronous systems called Globally Asynchronous Locally Synchronous

(GALS) systems, introduced previously, may be used to avoid using a global clock. In-

stead, local clocks are used in the submodules of the system, which are interconnected

using asynchronous interfaces. Since clocks are confined to smaller areas, clock skew is

reduced. However, as mentioned before, compiling digital designs into an architecture

which is neither purely synchronous nor purely asynchronous is hard.


2.7.2 Lower electromagnetic noise

The noise generated by synchronous systems is highly concentrated in one frequency —

the global clock frequency. This can cause interference within the system, as well as with

adjacent systems that operate at a similar frequency.

With no global clock, each part of an asynchronous system operates at its own speed.

Consequently, the system operates in a much wider range of frequencies compared to

a synchronous system. As a result, the electromagnetic noise of an asynchronous sys-

tem is not concentrated in a single frequency. The lower electromagnetic emission of

asynchronous systems has been used in several applications, such as a pager [44].

2.7.3 Lower power consumption

Power comparison of asynchronous systems and their synchronous counterparts depends

on the application and the technology. Three main factors should be considered when

comparisons are made:

• Clock tree versus handshake controls: Clock tree consumes a large portion of the

power because it switches at the fastest rate in the circuit. Local handshakes replac-

ing the clock tree in asynchronous circuits may consume significant power. However,

when the circuit is idle, the power consumption of the handshake control is mini-

mized automatically.

• Area and Leakage power: Generally, asynchronous circuits tend to be larger than

their synchronous rivals. The number of transistors to implement a specific cir-

cuit as asynchronous is often higher than in its synchronous counterpart due to

the handshake controllers. Larger area and more transistors increases the power

consumption, especially leakage power in deep submicron technologies.

The choice between asynchronous and synchronous implementation depends on the

application. In systems that quickly switch between idle and active modes such as burst


receivers and RFID, asynchronous design might be a better choice [45]. There are tech-

niques that reduce the power consumption of synchronous circuits during idle periods

such as gating clock and turning off the clock oscillator. However, these techniques

are costly; for instance, clock gating reduces the maximum operating clock frequency.

Switching an oscillator on and off is also a very costly solution in terms of delay and

energy consumption because it takes time and energy for the system to return to its

normal running mode.

2.7.4 Resilience to process and environmental variations

An important property of asynchronous systems is their adaptability. Since asynchronous

systems can be designed as self-timed circuits, they can adapt to variations in voltage,

temperature, data rate and even process. Several researchers have studied this property

and have successfully shown that asynchronous circuits can adjust to wide variations in

operating parameters [4, 5, 46].

2.7.5 Modularity

Another feature of asynchronous systems is modularity. In synchronous implementations,

interface design is always time consuming. It is a hard part of the design process, partic-

ularly in big projects where different submodules may require different clock frequencies.

Asynchronous systems are clock-less and thus, connecting asynchronous modules is easier

and should require less effort.

2.7.6 Higher performance

Finally, an important area of comparison of asynchronous and synchronous systems is

performance. Several research papers have been published on this matter [47, 48]. It

is not easy to answer the question ”are asynchronous systems faster than synchronous

ones?”

The performance of a synchronous system is limited by its slowest module. In the


design of synchronous systems, the worst-case situation must be taken into account. The

critical path under worst-case process-voltage-temperature (PVT) conditions defines the

clock period. By contrast, the performance of an asynchronous system is based on average

delay. The delay of the longest circuit path currently in operation determines the speed.

Therefore, the performance of an asynchronous system is data dependent. In addition to

data, the speed of the system adjusts to current PVT conditions and not the worst-case

conditions. As a result, if the worst-case delay is longer than the average-case delay and

if the worst-case delay happens rarely, then, an asynchronous implementation is likely to

outperform its synchronous counterpart.

One the other hand, asynchronous circuits are slowed down by delays. Handshakes

are mostly based on return-to-zero protocols which are generically slow [2]. Non-return-

to-zero protocols are potentially faster. However, circuits used to implement them are

bigger, and consequently, their performance is not better than circuits with return-to-zero

protocols.

2.8 Asynchronous Design Applications

Several application examples of asynchronous design are presented in this section, taken

from both industrial and academic research. Applications show where asynchronous

design is strong and useful.

The implementation of a Radio Frequency IDentification (RFID) system demonstrates

that the adaptability of asynchronous design may be used in systems that must work

in different environments [45]. In this work, researchers implemented an active reader

and a passive tag for a contactless RFID system. Energy and data can be transmitted

between the reader and the tag over a range of 6 centimeters. When the distance of the

tag changes, the asynchronous receiver can adapt to the new signal power. Therefore,

different power configurations can be figured out for the required data rate over the

channel. The implemented RFID is a high-speed low-power module that consumes power


only when it is active and can reach a data rate of 1.02 Mbps.

Another work is an ultra-low-power asynchronous processor [5]. This processor is

based on the Atmel’s Advanced Virtual RISC (AVR) instruction architecture. The im-

plemented processor can work in a wide range of voltages, including values very close to

the threshold voltage of a transistor. The adaptability of the processor to this wide range

of voltages allows it to consume different levels of power according to the required per-

formance. The processor is low-power and low-emission making it suitable for Wireless

Sensor Networks (WSN).

In 2005, Epson Company introduced its new flexible 8-bit asynchronous micropro-

cessor [4], which is shown in Figure 2.6. It is built on the Thin Film Transistor (TFT)

technology used for wearable devices. Due to the wide crystal variations in this tech-

nology, satisfying the restrictive constraints of synchronous design, such as a fixed clock

frequency is not possible.

Figure 2.6: Epson’s flexible processor [4]

An ARM-compatible microprocessor for smartcard chip is built using dual-rail proto-

col. Dual-rail has better resistance to attacks based on power analysis and electromag-

netic analysis, compared to single-rail protocol [49].

A study of a Viterbi decoder’s data path shows that the average-case delay of its data

path is only 84% of the worst-case delay. Since asynchronous circuits are based on the

average delay, an asynchronous design is used to achieve better performance over the


synchronous implementation [48].

A group of researchers at Stanford University designed a locally clocked globally

asynchronous cache controller which is twice as fast as its synchronous counterparts.

They used the same feature of average-case operation of asynchronous systems to achieve

this improvement [47].

Finally, the adjustability of asynchronous circuits to voltage and coupling variations

has been used in implementing a robust and high-throughput inter-chip communication

channel in [50].

Despite the advantages of asynchronous systems, building these systems is challenging.

These challenges are described in the next section.

2.9 Difficulties in Asynchronous Design

In this section, a general review of some difficulties in the design of asynchronous circuits

and systems is given, which can be categorized into four groups:

1. The lack of designers’ knowledge and experience in asynchronous design.

2. The lack of strong compilers and Electronic Design Automation (EDA) tools.

3. The lack of pre-designed asynchronous architectures (e.g., FPGAs).

4. Difficulties that are inherent to asynchrony.

First of all, asynchronous design methodologies are different from synchronous design

methodologies. Synchronous systems have been in the market for decades. Therefore,

many engineers are familiar with synchronous design flows and synchronous design tech-

niques have been well developed. Engineers and companies are not willing to use asyn-

chronous design unless there is a proof showing that asynchronous circuits can be useful

in solving real-world problems.

In comparison to synchronous tools, asynchronous design tools are not well devel-

oped. There are a few languages and packages such as Philips Haste and Epson Ver-


ilog+, but these languages are not publicly available yet. There are some academic tools

such as BALSA. The problem is that there is no consensus on a design methodology

for asynchronous implementation. Each company or university uses its own language,

which makes it more difficult for synchronous designers to understand where to enter the

asynchronous world. Many synchronous designers know Verilog/VHDL. However, these

languages in their present form are not suitable for describing asynchronous systems.

The lack of pre-designed asynchronous architectures is another obstacle for utilizing

asynchronous design. Many digital designers are not necessarily experienced in ASIC

design. They describe their desired hardware using an HDL and then simply use com-

mercial tools to target programmable logic devices such as FPGAs. FPGAs are useful

for prototyping and mass production of many applications.

Unfortunately, a similar design flow does not exist for asynchronous design. There is

no commercial asynchronous FPGA on the market, although some experimental asyn-

chronous architectures have been built [51]. FPGAs have been used for proving an idea or

demonstrating the performance of asynchronous systems [52]. Although this practice can

be useful for developing ideas, it is not suitable for industrial use. Present FPGA com-

pilers do not understand asynchronous designs and lead to very poor place and route or

even an unsuccessful fitting. Developing FPGA architectures that support asynchronous

design and providing EDA tools for them are necessary stepping stones to broaden the

use of asynchronous systems.

There are also many difficulties that are inherent to asynchrony, such as hazards and

deadlocks. Hazards can cause asynchronous systems to malfunction. In any computation,

there might be glitches due to different gate and wire delays. In synchronous systems,

these glitches are rarely important because flip-flops only accept inputs at clock edges.

In asynchronous systems, glitches are indistinguishable from real data and hence may

lead to errors. A discussion of hazard categorization and avoidance techniques can be

found in [1].


Deadlocks can occur in systems where two or more competing actions are waiting

for one another to finish, and thus neither ever does. While there is no general solution

for deadlock prevention, knowledge of the system operation along with the designer’s

experience can help in reducing the possibility of deadlocks.

2.10 Attacking Asynchronous Challenges

Despite the difficulties in the design of asynchronous systems, researchers have proposed

several methodologies to simplify the design process, which are reviewed here. An asyn-

chronous design methodology or design flow determines the way a designer describes the

behavior of an asynchronous system. This description is then synthesized for a specific

technology. Asynchronous design methodologies can be grouped in three main categories:

1. Design methodologies that start with a synchronous design and convert it to an

asynchronous system (desynchronization).

2. Design methodologies that use conventional hardware description languages or an

amendment of these languages.

3. Design methodologies that use Communicating Sequential Processes languages.

In this section, these design methodologies are introduced briefly and their advantages

and disadvantages are discussed.

2.10.1 Desynchronization

In this method, the target system is first implemented as a synchronous system, and then

converted into an asynchronous one. This process includes three main steps [5, 46,53]:

1. Conversion of flip-flops to master-slave latches.

2. Generation of matched delays for the combinational logic.

3. Implementation of latch controllers to control handshakes.


FF FF FF

CLK

(a)

CLCL

(a) Synchronous circuit

(b) Desynchronized circuit

Figure 2.7: Desynchronization method [5]

An example of a desynchronized circuit is shown in Figure 2.7.

The main advantage of this methodology is simplicity. Designers can use this method

to generate an asynchronous circuit provided that the desynchronization tool is available.

As reported in [5], the design flow is fast and many of asynchronous benefits are inherited.

In [46], a desynchronized DLX processor is compared against its synchronous counterpart.

The desynchronized processor is faster than its synchronous counterpart under typical

process-voltage-temperature (PVT) conditions.

The main disadvantage of the desynchronization method is the overhead of the com-

munication handshakes required between master and slave latches. As reported in [46],

the performance overhead due to desynchronization is 20% in a DLX microprocessor1.

This timing overhead stems from the delay between the fall of the slave enable signal and

the rise of the master enable signal. Also, the desynchronized processor is 13.44% larger

1The desynchronized processor is 20% slower than its synchronous counterpart under worst-case PVT condi-

tions.


than the synchronous design and consumes more power.

2.10.2 Methods using conventional HDLs

Digital designers use Verilog and VHDL to describe their designs. The advantage of

design methodologies that use these languages is that designers are familiar with the

language.

A drawback of using these languages is that they are not originally designed to describe

asynchrony. Therefore, many basic concepts of asynchronous systems such as handshakes

are not supported.

In an experiment by the author, VHDL packages to model and implement asyn-

chronous handshakes were written. The difficulties encountered include:

• Using proper asynchronous HDL packages does not guarantee the delay insensitivity

of the design.

• Using HDLs without enough experience of asynchronous concepts can easily result

in hazards and deadlocks.

• A synthesizable handshaking package was used with the Altera QuartusII FPGA

compiler. Although the package could be synthesized, post place and route simu-

lations clarified that QuartusII does not understand asynchronous concepts and a

significant amount of manual work is required to fix the handshakes.

Researchers have acknowledged these difficulties and tried to extend Verilog and

VHDL language structures to include asynchronous concepts. Besides language struc-

tures, a compiler is needed to compile the HDL amendment and synthesize the design into

asynchronous circuits. A group at Epson introduced Verilog+ to support asynchronous

data types and handshakes [4]. The Verilog+ code is translated back to Verilog, then

simulated using a conventional HDL simulator. The compiler to convert Verilog+ code

into Verilog is not complete yet and many steps of the synthesis are performed manually.

The design and synthesis tools are not publicly available.


Other researchers have tried to add packages to conventional HDLs in order to support

specific logic required to describe asynchronous circuits. For example, Null Convention

Logic (NCL) have been used for asynchronous design. The motivation behind using NCL

is:

1. It can be added as a package to VHDL.

2. It meets the quasi delay insensitive (QDI) requirements of asynchronous design.

NCL is a logic group with a set phase and a reset phase. In the set phase, data

changes from a space holder called NULL to a proper codeword. In the reset phase, data

line changes to NULL. NCL has been added as a package to VHDL and conventional

synchronous CAD tools were used to synthesize asynchronous designs [54–56].

2.10.3 Methods using communicating sequential processes languages

Communicating Sequential Processes (CSP) languages are used for programming con-

current systems and designing asynchronous systems. Examples are Tangram and Balsa,

which were introduced by Philips and the University of Manchester respectively. They

support asynchronous concepts and thus, low-level coding is not required to implement

handshakes. Tangram and Balsa are good examples of high-level design languages that

support asynchrony. Balsa is publicly available and its latest version at the time of

writing was released in the summer of 2006 [57]. Some further details about Balsa are

provided here because it is used in the next chapter.

Balsa has a user-friendly interface and its syntax is well suited for asynchronous

design. For example, a handshake channel is implemented using a single line of coding.

Special symbols are used to support the concepts of synchronization, sequentiality, and

concurrency. As an example, a Balsa description of a single-place buffer is presented in

Figure 2.8. This Balsa description builds an 8-bit single-place buffer. The circuit requests

a byte from the environment on its input channel i. When new data are available, the

circuit transfers the data to register x. Then, the circuit signals to the environment on its


Figure 2.8: Balsa code of a single-place buffer [6]

output channel o that new data are available and the environment reads the data when

it chooses. This operation is repeated because the handshakes are enclosed in a loop.

The Balsa description of the buffer is then converted to the handshake circuit shown

in Figure 2.9. Balsa’s synthesis is syntax directed. That is, there exists a one-to-one

Figure 2.9: Handshake circuit of the single-place buffer [6]

circuit representation for each line of the Balsa code.

The filled circuits in the figure are active ports that initiate a handshake. Hollow cir-

cles are passive ports that wait for the handshake to start. A handshake channel exists


between two connected ports. Activation is similar to a reset signal that when released,

initiates the operation of the circuit. After activation, the Loop initiates a handshake with

the Sequencer. The Sequencer first issues a handshake to the left-hand Fetch component

(→) causing data to be moved to the Variable element (x). The Sequencer then hand-

shakes with the right-hand Fetch component causing data to be read from the Variable

element. When these operations are complete, the Sequencer completes its handshake

with the Loop element, which starts the cycle again.

The control elements in the circuit such as the sequencer have some delay that limit

the performance of the system. To improve the performance, several optimizations and

techniques have been proposed. For instance, the technique presented in [58] overlaps

the delay of the handshakes in the channels of a sequencer and improve concurrency and

thus, performance.

The handshake circuit is then mapped into the target technology using predefined

implementations of the handshake modules. The user may choose between four-phase

single-rail and dual-rail implementations. The synthesis backend is compatible with both

Xilinx and Cadence physical design tools. The synthesized netlist is a Verilog code in a

target technology. The developers of Balsa claim that adding a new target technology

to Balsa is easy. These features make Balsa an appropriate design tool for asynchronous

research.

Balsa has an integrated development environment (IDE) that includes an editor and a

simulation environment. It includes a graphical simulation environment and an automatic

testbench generator. Balsa has built-in utilities that help in understanding the operation

of the system under design. For example, it shows the handshake circuit after coding the

circuit operation in Balsa (similar to Figure 2.8 and Figure 2.9).

Using the Balsa’s design environment helps in understanding the difficulties involved

in asynchronous design, such as deadlocks and hazards. Balsa helps in avoiding dead-

locks by recognizing obvious deadlocks and warning the designer about their existence.

However, it is the designer’s responsibility to avoid deadlocks. Further explanation of


Balsa and the suggested design flow employing it may be found in Chapter 3.

2.11 Concluding Remarks

In this chapter, several challenges in the design and implementation of synchronous sys-

tems are studied including PVT variations and performance and power requirements.

Asynchronous design is briefly reviewed. Several application examples are provided

demonstrating the potential advantages of asynchronous design over synchronous de-

sign. These advantages include better adaptability to PVT variations, lower power con-

sumption and better performance, which can be used to diminish some of the challenges

in today’s implementations. However, the features of asynchronous design have been

hindered by the design and implementation difficulties and by the delay and power con-

sumption of handshakes. In the next chapters, several design techniques to reduce these

difficulties are suggested.

Chapter 3

Application of Concurrency in

Asynchronous Design

3.1 Introduction

Many approaches have been proposed over the years for the synthesis of asynchronous

circuits. Syntax-directed compilers such as Balsa [9], Tangram [11,12] and other synthe-

sis tools for high-level languages such as SpecC [59] are able to synthesize a high-level

description into a gate-level netlist, without a need for the designer to become involved

in implementation details. The synthesis process guarantees correct sequencing of hand-

shake operations by using standardized control elements known as sequencers. The syn-

thesis process also assists the designer in avoiding timing hazards. This approach has

been used successfully in the implementation of large asynchronous systems, such as

ARM-compatible asynchronous processors [49,60].

This chapter examines the potential benefits and trade-offs involved in using edge-

triggered elements where write-after-read (WAR) hazards exist [61, 62]. A WAR hazard

exists in any digital system where a variable is first read and then new data are written

into it, as in first-in-first-out (FIFO) structures and in the implementation of accumu-

lation statements of the form Dst ← f(Dst, Src). If new data are written into the

33

Chapter 3. Application of Concurrency in Asynchronous Design 34

destination variable before the previous data have been consumed, a WAR error oc-

curs. To avoid WAR errors in asynchronous circuits, variables are implemented as simple

latches and a non-concurrent sequencing mechanism is used to move data. In the case of

accumulation statements, this necessitates the use of master-slave configurations.

Master-slave configurations have also been used in asynchronous pipelines to improve

performance. In [63,64], it is demonstrated that to improve the performance of a pipeline,

the rate of data generation and data consumption should be balanced, which may be

achieved by adding extra buffers to the pipeline.

In this chapter, the focus is on introducing concurrency in the operation of an asyn-

chronous circuit. It is shown that in many cases, using an edge-triggered configuration

instead of a master-slave configuration may lead to increased concurrency in the circuit’s

operation. The proposed ideas are applied to single-rail circuits that use the four-phase

handshake protocol. Similar techniques can be used for dual-rail asynchronous circuits.

In previous studies [58, 65, 66], researchers have shown how to accelerate the down

phase of handshakes by using T-elements in the implementation of sequencers, but em-

phasized that T-elements cannot be used in the presence of WAR hazards [58]. The

WAR hazard should be resolved before using T-elements. The early close scheme and

the interlock scheme are two approaches that have been suggested by Plana and Nowick

to resolve the WAR hazard [67]. They are described in Section 3.3. The approach pre-

sented in this chapter is to use edge-triggering. It is shown that T-elements can be used

in the implementation of sequencers, even in the presence of WAR hazards, by exploiting

edge-triggered storage elements. T-elements serve to break the sequence of handshakes

between different components such as functions, variables and multiplexers, resulting in

a shorter down phase. Timing analysis is presented to quantify the achievable gain in

speed and explain the simulation results. Guidelines are also given to assist designers in

applying the proposed approach to other handshake circuits.

The degree of concurrency enabled by different types of sequencing elements is re-

viewed briefly below as it is a key factor in determining the speed of the synthesized


circuit. Subsequent sections present handshake structures that use edge-triggering to im-

plement WAR operations and describe the trade-offs involved. Special attention is paid

to the synthesis of accumulation statements because of their importance in all forms of

digital processing.

The synthesis techniques proposed are amenable to syntax-directed compilation. The

experimental results given have been derived by simulating circuits synthesized by Balsa

and implemented by Synopsys Design Compiler. Synopsys Power Compiler was used for

simulation-based power analysis.

3.2 Background

3.2.1 Handshake circuits

Handshake circuits are represented by diagrams composed of handshake components such

as variables, functions and controls, with active and passive ports connected via push

or pull channels. Active ports initiate a handshake by sending a request signal, and

passive ports acknowledge the handshake. In a push channel, the sender sends data and

communicates with the receiver by generating a request signal, whereas in a pull channel,

the receiver initiates the handshake by requesting data. Not all channels carry data. Some

channels are purely for control and are referred to as sync or activation channels. Further

details about handshake circuits and conventions can be found in [9, 12,58].

3.2.2 Sequencers

Sequencers are components that control the timing of events in an asynchronous system.

Upon receiving an activation signal, Act, the two-step sequencer in Figure 3.1 triggers

a handshake on channel C1 followed by a handshake on channel C2. Two types of

sequencers have been proposed in the literature, which we refer to as S and T sequencers.

Their behavior is illustrated in Figure 3.2. An S-sequencer waits for the handshake on

channel C1 to be completed before starting the handshake on channel C2 [12]. A T-


Figure 3.1: Sequencer

(a) S-sequencer behavior (b) T-sequencer behavior

Figure 3.2: Comparison of S-sequencer and T-sequencer behavior


sequencer starts the handshake on channel C2 once the up phase of the handshake on

channel C1 is completed [58]. Since the activities on C2 proceed in parallel with the

down phase on C1, a T-sequencer introduces concurrency in the circuit’s operation.

The core components of the S and T-sequencers are the S and T-elements, respectively.

The T-element is shown in Figure 3.3. Following a request on Channel A, it initiates the

up phase of the handshake on Channel B by issuing B Req. When it receives B Ack,

it removes B Req and at the same time issues A Ack, thus allowing the two handshake

operations to proceed to completion concurrently. Implementation details of the S and

T-elements and the corresponding sequencers are available in [12,58].

(a) Symbol (b) Structure level (c) Behavior

Figure 3.3: T-element

3.3 WAR Hazards

Figure 3.4 illustrates a simple FIFO circuit. Each arrival of the activation signal, Act,

transfers data from V ar1 to V ar2 and from the input to V ar1. The circuit exhibits

a WAR hazard. The first command, C1, reads the content of variable V ar1 and saves

it in V ar2, and the second command, C2, writes new data into V ar1. If the data

input to V ar2 is still enabled, the new data would be incorrectly written into V ar2 as


well. Existing asynchronous synthesis methods realize variables with transparent latches

[5,12,22] and use an S-sequencer to guard against the WAR hazard [58]. An S-sequencer

ensures that the down phase on C1 has been completed, thus also ensuring that the input

to V ar2 has been disabled, before starting the C2 command. A T-sequencer cannot be

used in this case, because it would issue command C2 before C1 is completed, thus

leaving the WAR hazard unresolved. Note that the transferrer unit (→) consists solely

of port-to-port wire connections and has no timing implications.

Figure 3.4: An example of the WAR hazard

The sequential execution of handshakes imposed by the S-sequencer places an upper

limit on the speed of operation of a circuit. Provided some means for handling the WAR

hazard is available, the speed of operation can be increased by introducing concurrency.

In what follows, the early close scheme and the interlock scheme by Plana and Nowick [67]

to resolve the WAR hazard are reviewed. Then, the use of edge-triggering to avoid WAR

is studied.

The early close scheme adds an extra mechanism to the control circuit, which closes

the destination latch V ar2 before writing new data into V ar1. With some reasonable tim-

ing assumptions, this method makes it possible to use a T-sequencer. An 85% throughput

improvement has been achieved in a dual-rail implementation of an 8-stage shift regis-

ter [67], compared to the original design with S-sequencers. The power-delay product

was roughly the same.


In the interlock scheme, writing into the source latch V ar1 is stalled until V ar2 is

opaque again. This method guarantees correct operation, but the speed improvement

obtained by employing a T-sequencers is limited.

The approach proposed in this thesis to avoid the WAR hazard is to replace trans-

parent latches with edge-triggered flip-flops. An edge-triggered flip-flop occupies more

area and consumes more power than a latch. Thus, the choice between edge-triggered

flip-flops and normal latches amounts to a trade-off among the three parameters of speed,

area and power. In what follows we show that there are many situations in which the

trade-offs involved are strongly in favor of using edge-triggered circuits.

If the variables in the structure of Figure 3.4 are implemented as edge-triggered flip-

flops, the rising edge of the up phase of command C1 records the data into V ar2 then

immediately disables its inputs, making the down phase redundant. Therefore, the down

phase of C1 can be executed concurrently with the up phase of C2 without giving rise to

a WAR hazard. This means that a T-sequencer may be used instead of an S-sequencer.

Figure 3.5 shows a variable that employs edge-triggering. The write request signal

serves as the clock. The buffer element introduces some delay before generating the write

acknowledge signal to allow sufficient time for the data to be stored in the flip-flops. The

D Q

Write_Data

Write_Req

Write_Ack

Read_Data

Read_Req

Read_Ack

Figure 3.5: Edge-triggered-based variable

read request generates a read acknowledge immediately, because the output of the flip-

flop is always enabled. As with synchronous circuits, the designer must ensure that the

setup and hold times of the flip-flops are met.

To illustrate the trade-offs involved in the proposed approach, a 16-bit, 5-stage FIFO

circuit was synthesized as shown in Figure 3.6. It uses a bundled-data protocol and the


edge-triggered circuit of Figure 3.5 to implement the variables. The four-step sequencer

is implemented using three 2-step T-sequencers, as shown. Post-synthesis simulation

Var 1 Var 2 Var 3 Var 4 Var 5

T-seq

* ;

T-seq

* ;

T-seq

* ;

The rest of the circuit to

provide input to the first

stage and take the output of

the last stage

1

2

3 4

Figure 3.6: Five-stage FIFO

showed that the resulting concurrency led to a circuit 2.3 times faster than the cir-

cuit employing an S-sequencer and latches. This substantial increase is due in part to

the concurrency introduced by the T-sequencer. Also, the T-sequencer is a faster and

smaller circuit that presents lower electrical loads to its environment compared to the

S-sequencer. The increase in speed in this case comes at the expense of an increase in

area and power consumption. The circuit area increased by 52% and the power-delay

product by 3%. Implementation details are described in Section 3.7. Compared to the

results from the early scheme approach, edge-triggering provides better speed improve-

ment. However, the early scheme uses latches, which are smaller than edge-triggered


flip-flops.

This simple example clearly shows that the combination of edge-triggered flip-flops

and T-sequencers has the potential for significant increases in speed, but careful atten-

tion need to be paid to circuit area and power consumption where these are important

considerations. In what follows, we show that gains in both speed and power are possible

in many commonly encountered circuit configurations.

3.4 Using Edge-Triggering in Accumulator Circuits

The case of an accumulation statement, Dst ← f(Dst, Src), offers opportunities for

performance enhancement, where improvements in both speed and area are possible.

An accumulator loop is often synthesized using transparent latches, as shown in Fig-

ure 3.7. This is the implementation generated automatically by the Balsa compiler [6]. To

Auxiliary-Dst

Dst

Src

Act

C1 C2

Func

Figure 3.7: Current synthesis of Dst← f(Dst, Src) in Balsa

avoid the WAR hazard, the Auxiliary-Dst and Dst variables constitute a master-slave

structure, and an S-sequencer must be used to ensure correct data transfers. Follow-

ing the arrival of an activation signal, command C1 causes the result of function f to

be stored in the auxiliary variable. When this step is completed and the input of the

auxiliary variable has been disabled, command C2 transfers the result to the destination

variable Dst.

Several variations on this basic configuration are discussed below using the four-phase


single-rail protocol. To assess the impact of the changes, two test circuits are used. The

first is a minimal system containing a single 16-bit accumulator. The system accepts

successive input data elements, activates the accumulation loop, then sends out the

accumulation results. The second system is an 8-bit by 8-bit radix-4 Booth multiplier.

In the ensuing discussion, the implementations based on the configuration of Figure 3.7

are used as a reference for comparison. Changes in speed, area and power-delay product

are expressed as a percentage of the corresponding parameters for the reference circuit.

Complete simulation details are presented in Section 3.7.

In some cases, it is possible to design handshake circuits that take advantage of the

auxiliary variable to perform useful tasks. For instance, the auxiliary variable may be

shared between different operations and routed to different destination variables [68].

In other cases, the master and slave variables, may be replaced with one edge-triggered

register. In the latter case, the sequencer is no longer needed, and the accumulator circuit

may be simplified to the configuration in Figure 3.8. A request signal on the activation

Figure 3.8: Revised accumulator circuit

channel is forwarded to function f and it propagates from there to the read ports of Src

and Dst. When the values of Src and Dst have been read and processed, the function

unit sends an acknowledge signal to the transferrer module. This signal becomes the

request to the write port of variable Dst, and the signal’s rising edge clocks the data into

it.

Simulation results for the 16-bit accumulator test circuit using the configuration in

Figure 3.8 showed a 44% increase in speed, 9% decrease in area and 22% decrease in the


power-delay product compared to the reference circuit of Figure 3.7. In the case of the

multiplier, the circuit based on Figure 3.8 was 16% faster than that based on Figure 3.7

and had 7% less area and power-delay product.

It should be noted that these improvements are possible because of using edge-

triggering. In comparison, the early close scheme may be used to improve the per-

formance of the original accumulation configuration of Figure 3.7 by allowing the use of

a T-sequencer instead of the S-sequencer. However, it results in a larger circuit because

both the the auxiliary latch and the sequencer are needed.

3.5 Introducing Concurrency

There are three channels in Figure 3.8, Act, Func and Write. The handshakes associated

with these channels occur sequentially, and hence, the delays involved are additive, as

illustrated in Figure 3.9. The total delay for a single accumulation operation is Fup +

Wup + E + Fdown + Wdown, where E is the delay of the down phase of the environment

that issues the activation signal, Fup is the up phase delay of the handshake on the Func

channel and Wup is the up phase delay of the Write channel. Similarly, Fdown and Wdown

are the down phase delays of the handshakes on channels Func and Write. The speed

of operation can be increased if it is possible to break this sequence of handshakes into

two parts that can be overlapped. This may be achieved by inserting a T-element in one

of the channels, provided, of course, that the integrity of the data transfer operation is

not compromised.

Fup E Fdown

Wdown

Act_Req / Func_req

Func_Ack / Write_Req

Wrtie_Ack / Act_Ack

Wup

Figure 3.9: Timing diagram for the circuit in Figure 3.8


Figure 3.10 shows the case where the T-element is inserted in the Func channel.

Because this channel carries data, the T-element is augmented by a wired connection

on the data paths resulting in the element shown in Figure 3.11, which will be referred

to as a T-isolator. The operation of the circuit is illustrated by the timing diagram

in Figure 3.12. An activation request results in a request on the Inter channel then

the Func channel after a delay T through the T-isolator. When the function receives

the request and subsequently generates an acknowledgment, the T-isolator removes the

request and, at the same time, sends a request to the destination variable through the

transferrer. Thus, the down phase of the function proceeds in parallel with both the

Write operation at Dst and the completion of the handshake on the activation channel.

We have assumed for simplicity that the T-element has the same delay, T , for both the

up and down phases and for both of its ports.

Act

T-isol

Dst

Src

Write

Inter Func

Figure 3.10: Inserting a T-element in the Func channel

Figure 3.11: The T-isolator

The diagram of Figure 3.12a shows the case Fdown < Wup +E. The total delay for one

data operation in this case is 3T + Fup + Wup + E + Wdown. When Fdown is greater than


Fup

T

Wup

T

E

Fdown

Wdown

T

Act_Req / Inter_Req

Func_Req

Func_Ack

Inter_Ack / Write_Req

Wrtie_Ack / Act_Ack

(a) when Fdown < (Wup + E)

Fup

T

Wup

T

E

Fdown

Wdown

T

Act_Req / Inter_Req

Func_Req

Func_Ack

Inter_Ack / Write_Req

Wrtie_Ack / Act_Ack

(b) when Fdown > (Wup + E)



Wup + E, Figure 3.12b, the delay is 3T + Fup + Fdown + Wdown. Compared to Figure 3.9,

the reduction in the total delay for one data operation is given by:

Reduction in delay =

Fdown − 3T if Fdown < (Wup + E)

(Wup + E)− 3T if Fdown > (Wup + E)(3.1)

Equation (3.1) shows that a net reduction in delay will be achieved whenever the delay

resulting from the insertion of the T-element (3T ) is less than both Fdown and Wup + E.

Because of the early removal of the request signal to the function, the configuration in

Figure 3.10 is allowable only in cases in which the function is a combinational circuit that

will not change its output data until the data in either Src or Dst are changed. Variable

Src is controlled by the environment and is assumed not to change until the activation

handshake is completed. Variable Dst will not change until it receives the rising edge of

the write request.

Inserting the T-isolator in the Write channel yields the circuit in Figure 3.13, and

the corresponding timing diagram is given in Figure 3.14. In this case, the down phase of

the Write channel is overlapped with both the environment’s delay and the down phase

of the function. The resulting reduction in delay is given by (3.2).


Wdown − 3T if Wdown < (E + Fdown)

(E + Fdown)− 3T if Wdown > (E + Fdown)(3.2)

The third possibility is illustrated in Figure 3.15 and the timing diagram is in Fig-

ure 3.16. When an activation signal is received from the environment, one data transfer

takes place within the accumulator loop. Then, upon receiving an acknowledge signal

on the Mid channel, the T-element removes its request and simultaneously sends an ac-

knowledge signal to the outside environment. Thus, the completion of the handshake on

the activation channel proceeds in parallel with the down phase of the handshakes inside


Figure 3.13: Inserting a T-element in the Write channel

Fup

T

Wup

E Fdown

Wdown

T

Act_Req / Func_Req

Func_Ack / Inter_Req

Write_Req

Write_Ack

Inter_Ack / Act_Ack

T

(a) when Wdown < (E + Fdown)

Fup

T

Wup

E Fdown

Wdown

T

Act_Req / Func_Req

Func_Ack / Inter_Req

Write_Req

Write_Ack

Inter_Ack / Act_Ack

T

(b) when Wdown > (E + Fdown)



T-elem

Figure 3.15: Inserting a T-element in the Act channel

the accumulation loop. The reduction in delay in this case is given by (3.3).


E − 3T if E < (Fdown + Wdown)

(Fdown + Wdown)− 3T if E > (Fdown + Wdown)(3.3)

It should be noted that the same delay (T ) has been used for different transactions

through the T-element to simplify the figures and expressions. The term 3T represents

the sum of all the delays introduced by the T-element during a complete data transfer

operation.

3.6 System Timing Optimization

The benefit of introducing concurrency as well as the choice of the optimum location

for inserting a T-element can be assessed using expressions 1–3. The placements of the

T-elements in Figures 3.10, 3.13 and 3.15 differ in the two delays being overlapped, the

smaller of which determines the possible gain in speed. Clearly, the optimum location

for inserting the T-element is the location where that delay is as large as possible.

The three possible insertion points produce three pairs of overlapped delays as shown


Fup

T

Wup

T

E

Fdown

Wdown

T

Act_Req

Mid_Req / Func_Req


Write_Ack / Mid_Ack

Act_Ack

(a) when E < (Fdown + Wdown)

Fup

T

Wup

T

E

Fdown

T

Act_Req

Mid_Req / Func_Req


Write_Ack / Mid_Ack

Act_Ack

Wdown

(b) when E > (Fdown + Wdown)



in Table 3.1. The optimum insertion point can be readily determined when the values of

these delays are known. In most cases, using a T-element to isolate the circuit element

having the longest delay yields the optimum or near optimum result, as the examples

below illustrate.

Table 3.1: Overlapped delays for different insertion points

Location Overlapped delays

Func Wup + E || Fdown

Write E + Fdown || Wdown

Act Fdown + Wdown || E

3.6.1 System examples

The proposed methodology has been examined using the bundled-data 16-bit accumu-

lator and the 8-bit by 8-bit multiplier test circuits. First, the accumulator was tested

inside a simple test circuit that generates repeated data requests. Handshake delays in

Balsa implementations depend on the data in the corresponding data channels, and as

such they vary from one transaction to the next. Variability is on the order of ± 5%.

The delay values obtained from simulations for one data transaction in the accumulator

test circuit are:

Fup = 1795 ps , Fdown = 1362 ps

E = 487 ps

Wup = 539 ps , Wdown = 584 ps

T ≃ 200 ps

The total time for the transaction in Figure 3.9 based on these values is 4767 ps. Exam-

ination of the overlapped delays in Table 3.1 shows that the largest possible reduction

in delay is obtained when the T-element is inserted on the Func channel. Using (3.1),

the delay reduction in this case is (Wup + E − 3T ) = 426 ps. The other scenarios of

T-element insertion increase the circuit delay. If a T-element is inserted on the Write

channel, the delay change is (Wdown − 3T ) = −16 ps according to (3.2). Similarly, the


insertion of a T-element in the Act channel changes the delay by (E − 3T ) = −113 ps

according to (3.3).

Simulation results for the case of a T-element inserted on the Func channel showed a

reduction in delay of 362 ps. The difference between the expected and observed delays is

a result of the changes in port loading when the circuit configuration changes. Also, the

value of T used in the analysis is an approximate average for all the T-element delays.

The delays of different transactions on the ports of the T-element are within ± 10 ps of

the 200 ps given above, with the delay from B Ack+ to B Req− in Figure 3.3c being

the largest.

Improvements in the overall performance are affected by both the accumulator loop

delay and the delays in its test environment. The time of 4767 ps of the handshake

operation in Figure 3.9 represents about 60% of the delay between successive data trans-

actions. The remainder of the delay is introduced by the test circuit that loads new data

in the Src variable and sends a new activation signal. Insertion of the T-element in the

accumulation loop resulted in an overall increase in speed of about 4%. The accumulator

circuit alone was 50% faster than the original reference circuit of Figure 3.7 and had a

23% lower power-delay product. Also, simulations confirmed that the insertion of the

T-element at locations other than the Func channel resulted in lower performance.

The multiplier was used to test the gain in performance in a larger circuit. There are

three accumulator loops, ACC1, ACC2 and ACC3, inside the multiplier, as shown in

Figure 3.17. The delay values for the accumulators are given in Table 3.2.

Table 3.2: Delay values in ps for the accumulators inside the multiplier

ACC1 ACC2 ACC3

Fup 1292 2340 457Fdown 917 1786 686E 1699 2173 406Wup 706 2110 1520Wdown 798 1572 1325T 200 200 200


T-seq*

MUX

f3

Var: product

f2 f1

MUXVar: temp Var: iteration

The rest of

the multiplier

ACC3 ACC2ACC1

Func1 Write1

Act1Act3

1 2

ACT

Func2Write2Write3Func3

Act2

Figure 3.17: Accumulation loops inside the multiplier

The optimum place to insert a T-element for each accumulator and its environment

can be determined using Table 3.1. Because of the long delays in the environment of

ACC1 and ACC2, the optimum insertion points are on the activation channels Act1 and

Act2. The resulting reduction in delay is 1099 ps (20%) for ACC1 and 1573 ps (16%)

for ACC2.

The best place to insert a T-element for ACC3 is on the Write3 channel. However,

this accumulator is connected to port 1 of the T-sequencer, which means that its operation

is already overlapped with its environment. According to Figure 3.2, the T-sequencer

overlaps the down phase of Act3 with both the down phase of Act1 and the delay of

the subsystem issuing ACT . Thus, the down phase delay of Act3 is overlapped with

a larger delay and decreasing it is not useful. Furthermore, inserting a T-element in

channel Write3 increases the up phase delay of Act3 by 2T , thus reducing the overall

performance.

Insertion of T-elements on the activation channels of ACC1 and ACC2 resulted in a

10% speed improvement, at the expense of a slightly higher power-delay product. As in


the case of the simple accumulator, insertion of the T-elements at other locations resulted

in lower performance.

3.7 Experimental Methodology and Results

The synthesis and simulation flow used in testing the circuits presented in this chapter

is shown in Figure 3.18. Balsa was used to design and synthesize the test circuits, and

technology-dependent optimizations were performed using Synopsys. First, circuits were

described using the Balsa language and tested for functionality using Balsa’s simulation

tool. Then, Balsa converted the Balsa description of the design into a technology sup-

ported by Balsa. Balsa’s output is a Verilog netlist, which was converted to a generic

Verilog file using simple scripts. The generic file was fed to Synopsys along with appro-

priate scripts and the Synopsys design constraints (SDC) file to prevent it from omitting

delay elements. Synopsys Design Compiler synthesized the generic Verilog netlist to the

desired technology and optimized the circuit for speed and area.

Synopsys Design Compiler also creates a delay file in the standard delay format (SDF).

The delay file created by Design Compiler was fed to the simulator, ModelSim, along with

the gate-level netlist for post-synthesis simulations. ModelSim was also used to record

the switching activity of the circuit in the switching activity interchange format (SAIF).

Then, the netlist, SAIF file and technology libraries were fed to Synopsys Power Compiler

for average power analysis.

Table 3.3 summarizes the post-synthesis simulation results for various implementa-

tions of the circuits described earlier in the chapter — a 16-bit 5-stage FIFO, a 16-bit

accumulator and an 8-bit by 8-bit radix-4 Booth multiplier. They were all implemented in

180-nm TSMC technology. The Booth multiplier implementation code in Balsa is shown

as an example in Appendix B. The three accumulators in the multiplier implementation

correspond to the three accumulation statements, which are identified by comments in

the code.


Balsa

Description

Balsa

Simulation

Convert to Generic

Verilog

Synopsys

Design Compiler

ModelSim

Simulation

Synopsys

Power Compiler

Balsa

Synthesis

Technology

Dependant Verilog

Netlist

Handshake

Circuits

Generic

Verilog Netlist

Synthesized

Verilog

& SDF

Timing/Functional

Analysis & SAIFPower Analysis

Results

Synopsys Design

Constraints

Figure 3.18: Design flow


For each test circuit, a reference configuration was chosen, as described before. The

average speed and power were obtained by generating 100 random data and the same

data were applied to all implementations. The power-delay product was calculated by

multiplying the average power of the system at the highest performance by the average

delay between two consecutive inputs. Thus, the power-delay product obtained was the

average required energy for processing each input datum.

Table 3.3: Simulation results

Relative RelativeCircuit Name Description Relative average power-delayarea speed product

FIFO1 Using transparent latches 1 1 1

FIFO Using edge-triggered flip-FIFO2 flops 1.524 2.305 1.035

Balsa implementation, usingACC1 master-slave latches (Figure 3.7) 1 1 1

Using edge-triggering as inAccumulator ACC2 Figure 3.8 0.912 1.439 0.782

ACC2 with optimum insertionACC3 of T-element 0.922 1.502 0.774

Using master-slave latchesMUL1 as in Figure 3.7 1 1 1

Using edge-triggering as inRadix-4 MUL2 Figure 3.8 0.935 1.160 0.936

BoothMultiplier MUL2 with optimum insertion

MUL3 of T-elements 0.945 1.274 0.954

MUL2 with non-optimumMUL4 insertion of T-elements 0.945 1.235 0.959

The results in Table 3.3 show the trade-offs possible between latch-based and edge-triggered

implementations of WAR operations. In the case of a FIFO, edge-triggered flip-flops pro-

vide a much faster circuit at the expense of a larger area and more power. For accumulation

statements, edge-triggering wins over latch-based implementation in speed, area and energy

consumption. In handshake circuits with accumulation statements, it is possible to increase

concurrency and speed further by inserting T-elements at the optimum places found using Ta-

ble 3.1, as exemplified by ACC3 and MUL3. Multiplier MUL3 is 27% faster than the reference

circuit and has 5% less area and power-delay product. Multiplier MUL4 demonstrates the case

where T-elements are inserted in non-optimum places, Func1 and Func2, and as a result it is


slower than MUL3.

In order to generalize the results from the specific 8-bit and 16-bit examples, the sizes of S-

and T-elements and different types of latches and edge-triggered flip-flops are given in Table 3.4

for the technology used in the experiments. The storage elements used in the experiments are

highlighted in the table.

Table 3.4: Area of storage and control elements

Circuit Area (µm2)

S-element (Two-step S-sequencer) 81.31T-element (Two-step T-sequencer) 52.80Active-low enable latch 42.28Resettable active-low-enable latch 57.66Resettable negative-edge flip-flop 73.04Positive-edge flip-flop 57.66Resettable positive-edge flip-flop 73.04

Balsa normally uses active-low-enable latches, and generates the control circuit correspond-

ingly. Hence, negative-edge-triggered flip-flops were used to replace the latches in order to keep

the control logic intact. The negative-edge-triggered flip-flop provided in the standard cell li-

brary used in this chapter is resettable. However, in most of our test circuits, the reset of the

flip-flop is tied high as it is not needed.

The increase in area in the edge-triggered implementation of the FIFO can be explained

by the data in Table 3.4. In the FIFO circuit, 16 non-resettable latches were replaced with 16

resettable flip-flops, and three 2-step S-sequencers were replaced with three 2-step T-sequencers

(see Figure 3.6). Thus, the total area was increased. In the case of the accumulator, the storage

elements had to be resettable to clear the accumulator initially. Therefore, 32 resettable latches

were replaced with 16 resettable flip-flops, a two-step S-sequencer was replaced with a T-isolator

(see Figure 3.8 and Figure 3.10), which has the same size as a T-element. Since the area of two

resettable latches is larger than the area of a single resettable flip-flop, the area of the circuit

is reduced. The same type of analysis applies to the multiplier circuit in which, each pair of

non-resettable latches in the accumulation loops were replaced with a single flip-flop.

A resettable flip-flop is a significantly larger circuit compared to a non-resettable flip-flop as

shown in the two last rows of Table 3.4. Smaller and thus lower-power circuits can be obtained


by using a standard cell library featuring non-resettable negative-edge flip-flops or by modifying

Balsa to generate control signals for positive-edge flip-flops. Another possible improvement is

to use faster C-elements. The C-elements in the architecture of T-elements were implemented

by logic gates as the test circuits were synthesized using standard cells available in the library.

Hence, better results would be expected if the T-element or C-element are part of the library

or if a custom layout design is used.

3.8 Conclusion

This chapter demonstrated that the use of edge-triggering in the synthesis of handshake circuits

is an effective means for avoiding the write-after-read hazard and offering a range of trade-offs

among speed, area and power. Significant speed improvements in three different test circuits

have been demonstrated. With edge-triggering it becomes possible to use T-elements to intro-

duce concurrency in the circuit’s operation, thus increasing speed further. The speed of a simple

16-bit accumulator circuit increased by 50%, accompanied by a reduction in the power-delay

product of 23%.

The introduction of concurrency using T-elements leads to a significant reduction in the

penalty associated with the down phase of 4-phase handshake circuits. Criteria for optimized

placement of the T-elements have been proposed and tested in a multiplier circuit implemented

using Balsa and Synopsys. In all cases, the proposed circuits are compatible with the require-

ments of syntax-directed compilation.

Chapter 4

Enhanced Synchronous Design

4.1 Introduction

Asynchronous circuits have unique features that can resolve or diminish many of the urgent

problems of today’s applications in the deep nanometer regime. Compared to synchronous

design, these features include ability to adapt to process and environmental variations, potential

average-case rather than worst-case performance, lower noise and lower power consumption.

However, these advantages are not readily available to designers because of the difficulties

involved in the design of fully asynchronous systems.

Asynchronous circuits depend on fine-grained handshakes for timing and the overhead of

the control circuits to implement handshakes is significant. That is, in many cases, the con-

trol circuits become the bottleneck of the system, resulting in a degraded performance. The

examples in Chapter 3 demonstrate the overhead involved.

Other asynchronous methods such as desynchronization, though easier to design, also suffer

from asynchronous control overhead. In the desynchronization method, a master-slave configu-

ration replaces edge-triggered flip-flops to allow fully asynchronous operation between pipeline

stages. However, the delay between the fall of the slave enable signal and the rise of the master

enable signal results in a 20% performance overhead in a DLX microprocessor [46]. Also, the

desynchronized processor is 13.44% larger than the synchronous design and consumes more

power, especially leakage power.

58

Chapter 4. Enhanced Synchronous Design 59

A synchronous circuit depends on a single control signal, the clock, for the timing of its

operation. This often results in lower control overhead compared to asynchronous circuits.

However, in most synchronous systems, the clock signal has a fixed frequency determined by

the worst-case process-voltage-temperature (PVT) analysis of the most critical path. Hence,

performance is limited by worst-case parameters, even though the system may be operating

under more favorable conditions most of the time.

This thesis proposes a hybrid approach that combines the best features of both synchronous

and asynchronous systems. It is shown that the clock signal can be controlled dynamically using

asynchronous logic. While the main core of the system remains synchronous, the controlling

circuit that generates the clock signal is asynchronous logic. Thus, many beneficial features of

asynchronous design are brought to the synchronous environment. At the same time, the ease of

design and low control overhead of synchronous systems are retained. The resulting system will

be referred to as a PVT-aware self-tuning system. It is able to tune the timing of its operations

to produce the best-possible results under the prevailing PVT conditions. Accordingly, the

design approach is referred to as PVT-aware self-tuning design.

Although the proposed architecture benefits from its asynchronous nature, the use of asyn-

chronous design is limited to the clock generation circuit, with the rest of the system being a

synchronous circuit that can be designed, synthesized, and laid out using a well-established and

understood design flow. As will be shown, the whole system, including its asynchronous clock

generation part, can be implemented using conventional tools and standard cells. It does not

involve any asynchronous design overhead, except for a pre-designed clock generation circuit.

This chapter provides a high-level overview of the proposed PVT-aware design. Detailed

implementations, and the corresponding design flows are presented in Chapter 5 and Chap-

ter 6. Chapter 5 shows how the PVT-aware design approach can be used to reduce the power

consumption of a high-speed circuit. Chapter 6 builds on the proposed PVT-aware design

and adjusts the clock frequency of a pipelined system with the operations taking place in the

pipeline in addition to the PVT conditions. The objective is to increase the speed without

significantly increasing area and energy consumption.


4.2 PVT-aware Design Approach

The key component of a PVT-aware self-tuning system is its clock generation circuit, which is

shown in Figure 4.1. The chip area is divided into multiple regions and a PVT-aware completion

detection circuit is included in each region. After receiving a clock pulse from the clock pulse

generator at the center, the completion detection circuit introduces a delay that matches the

delay of the critical path of the system, then sends a completion signal to the clock pulse

generator. When all completion signals are received, the clock pulse generator generates a

new clock pulse. The clock pulse generator waits until it receives all completion signals before

issuing a new clock pulse. Thus, the period of each clock cycle is determined by the longest

delay introduced by the completion detection circuits.

CPG

CD1 CD2

CD3 CD4

Clock

pulse

Completion

signal 1

Region 1 Region 2

Region 3 Region 4

Loop 1 Loop 2

Loop 3 Loop 4

CD ≡ Completion Detector, CPG ≡ Clock Pulse Generator

Figure 4.1: Clock generation design

The clock generation circuit is redrawn in Figure 4.2 for better clarity. A completion detec-

tion circuit together with the clock pulse generator forms a loop, which its delay should match

the delay of the critical path of its region. The delay of the loop is tuned using a static timing


CPG

CD4 of Region 1

CD3 of Region 2

CD2 of Region 3

CD1 of Region 4

Completion

Signal 1

Clock

Pulse

CD ≡ Completion Detector, CPG ≡ Clock Pulse Generator

Figure 4.2: Clock generation loops

analysis (STA) tool.

Multiple completion detection circuits are placed at different regions, which may be subject

to different temperatures and voltages at different times. Also, the fabrication process may

result in parameter variations from one area to another. As PVT conditions change, the delay

of the loop (composed of the completion detection circuit and the clock pulse generator) tracks

the changes in the delay of the corresponding critical path. Since the clock generation circuit

waits for all completion signals, the loop having the longest delay in each clock cycle determines

the clock period.

4.3 Solving Real-world Problems

As technology scales down to smaller feature sizes, it becomes more difficult to predict the

quality of the fabricated chips. In addition to process variations, voltage and temperature

variations should also be accounted for during design. In the traditional multi-corner static

timing analysis approach, many PVT corners including the worst-case PVT corner, are used

to test the timing of the design and to ensure a reasonable chip yield. Due to the nature

of this approach that designs for the worst-case scenario, the resulting chips are large, power

consuming and slow. Another design approach is to use statistical static timing analysis (SSTA)

that unlike traditional static timing analysis, produces timing analysis based on the yield. The

designer can trade off performance for yield [69]. However, SSTA is very complicated and yet


immature [70].

The main advantage of the proposed technique is its adjustability to PVT variations. It

features an on-chip clock generation which is subject to the same fabrication process as the rest

of the chip. During run-time, the on-chip clock generation circuit synthesizes a clock period

suiting the quality of the fabrication process and the prevailing voltage and temperature. This

mitigates the process variations problem. That is, the resulting chip tunes itself to produce

the best-possible results, given the inter-chip (die-to-die) and intra-chip (within-die) variations.

Because of this self-tuning mechanism, the designer does not need to worry as much about

delays and variations. Hence, complicated analysis such as SSTA to deal with variations are

not as necessary.

When a design is optimized and implemented, different constraints are used to guide op-

timization tools for trade-offs in speed, power and area. These optimizations are usually in

conflict with each other. As shown in Figure 4.3 [71], designing and optimizing for higher clock

frequencies result in larger and more complex circuits, which consume more power. Particularly,

in deep nanometer technologies, high frequency requirements push the optimization tool to use

a large number of high-speed high-leakage cells, resulting in an increase in leakage power.

Most synchronous design flows are limited by the assumption of using a fixed clock frequency,

which is determined using the worst-case PVT conditions. The PVT-aware self-tuning design

exploits the fact that a system is subject to typical rather than worst-case PVT conditions

most of the time. Since the resulting system is PVT-aware, it can be implemented to be as fast

as its traditionally designed counterpart under typical conditions, but with significantly lower

power consumption and lower area as explained in Chapter 5. Alternatively, the PVT-aware

self-tuning mechanism can be used to implement a system, which is faster than its traditional

fixed-clock counterpart under typical conditions as explained in Chapter 6.

According to Figure 4.3, at the extreme end with high-frequencies, power reduction is desired

and at mid-range clock frequencies, achieving higher throughputs without increasing area and

energy consumption is desired. Chapter 5 focuses on power reduction at the extreme end of the

design space with high clock frequencies and Chapter 6 focuses on improving the performance

of the systems with mid-range clock frequencies.


Figure 4.3: Design Space

Chapter 6 also adds a new feature to the proposed PVT-aware design approach. It demon-

strates a design technique to implement pipelines that automatically adjust their clock period

according to the operations currently performed in the pipeline. Therefore, the speed of the

system is not limited by the delay of the slowest possible operation.

Chapter 5

Leakage Reduction

5.1 Introduction

Power reduction is an important objective in the design of today’s high-performance systems,

particularly in portable devices. Meeting power requirements is getting more difficult in to-

day’s shrinking technologies because of leakage power. As technology scales to smaller feature

sizes, leakage power becomes a more substantial portion of the total power, due to two main

factors. First, the gate length and threshold voltage of transistors are reduced, resulting in a

substantial increase in the leakage power [14]. Secondly, process variations push the designers

to use more conservative delay estimations, which result in overly complex and leaky circuits.

Process variations make the quality of fabricated chips less predictable and hence, more and

more conservative delay and clock frequency estimations are used [72]. This is an undesired

over-engineering to ensure that a large percentage of the fabricated chips meet performance

requirements. For example, using multi-corner static timing analysis, a digital system is de-

signed to deliver the required performance under all PVT corners, including worst-case PVT.

To this end, many high-speed high-leakage cells are used, resulting in significant leakage power

consumption.

This chapter demonstrates that PVT-aware systems can be designed to have significantly

reduced leakage power consumption. The PVT-aware self-tuning mechanism presented in Chap-

ter 4 is used here to introduce a new low-power design approach. According to the proposed

64

Chapter 5. Leakage Reduction 65

approach, a digital system is designed to meet the required clock frequency under typical rather

than worst-case PVT conditions. The system adjusts its clock frequency automatically as PVT

conditions change, either inter-chip or intra-chip. To implement PVT-aware systems, the on-

chip clock generation circuit proposed in Chapter 4 is used. This chapter presents the gate-level

design of the clock generation circuit along with a simple standard-cell ASIC design flow to im-

plement PVT-aware systems. Systems equipped with the proposed self-tuning circuitry are

designed with less pessimistic assumptions and over-engineering. Hence, they are simpler sys-

tems that meet the timing requirements with a smaller number of high-leakage cells and thus,

significantly reduced leakage. In a case study of a DLX microprocessor, leakage power is re-

duced by 10X under typical PVT conditions and by 7X under worst-case PVT conditions using

the proposed approach. Other advantages include a reduction in dynamic power, resilience to

PVT variations, and suitability for voltage scaling.

Most of the previously proposed PVT-aware approaches focus on improving performance [3,

7,53,73] or on reducing dynamic power consumption [8]. The approach presented here focuses

on reducing leakage power consumption while retaining the required performance under typical

conditions. Comparison to previous work is presented in Section 5.9.

In subsequent sections, opportunities to reduce power are demonstrated and the PVT-aware

architecture is described, followed by a methodology to integrate the proposed architecture

into conventional digital design flows. As a case study, the proposed design methodology is

demonstrated using a free-license DLX microprocessor, and complete post-layout results in

90nm technology are presented.

5.2 Review of Power Management Techniques

In today’s technology nodes, leakage power is a significant contributor to the total power, as

the gate length and threshold voltage are scaled down. Several techniques can be applied at the

circuit level to reduce leakage power, including multi-threshold libraries, multiple and dynamic

supply voltages, power gating and variable body biasing [14,74].

Multi-threshold libraries feature different implementations for functions, including high-


voltage-threshold (HVT), standard-voltage-threshold (SVT) and low-voltage-threshold (LVT)

cells, which have different speed and leakage characteristics. Of them, the LVT cells are the

fastest and have the highest leakage. They are used by the synthesis and optimization tools in

critical paths. The SVT and HVT gates are used in less-critical paths to reduce leakage power.

Power gating and body biasing techniques may also be used to control leakage dynamically

in sections of the chip that are idle. An adaptive body biasing approach is presented in [75].

The chip area is divided into small blocks; each block has a replica of the critical path. The

delay of the replica is used as an indicator for body-biasing the transistors in that block. The

use of replicas leads to an excessive increase in area.

Dynamic power depends on the switching activity of the circuit. An effective way to reduce

dynamic power is to gate the clock input to the sections of the circuit that are not performing

useful task [15]. Clock gating introduces extra area overhead, and thus, the granularity of

clock-gated blocks has to be selected carefully to avoid a large increase in the leakage power.

Dynamic voltage scaling is another technique that is used in many systems such as laptops

to deliver high throughput when required by increasing the input voltage to the system. During

idle periods, the input voltage is reduced to extend battery life [76]. The Razor project [8] uses

a PVT-aware mechanism to reduce dynamic power consumption by dynamic voltage scaling.

The approach described in this chapter reduces the leakage power and the dynamic power

during both active and sleep modes of the circuit. It can be combined with other power reduction

techniques to improve power properties further as the examples described later demonstrate.

The resulting PVT-aware system is amenable to dynamic voltage scaling as it automatically

adjusts the operating frequency to the input voltage level.

5.3 Proposed Idea

An IC foundry characterizes its technology under different PVT conditions, known as PVT

corners. The PVT corners for the 90nm technology used in the experiments are given in

Table 5.1 for a 1.0 V supply voltage. Unless otherwise stated, the conditions in Table 5.1 are

those referenced throughout this chapter.


Table 5.1: PVT corners

PVT corner Process Voltage Temperature

Best Fast 1.1 -40◦CTypical Typical 1.0 25◦CWorst Slow 0.9 125◦C

For worst-case design, the best PVT corner is used for hold time check and the worst PVT

corner is used for setup time check. Critical paths are identified under the worst PVT corner.

Then, the synthesis and physical design tools are instructed to optimize the design to meet

the required performance under the worst-case conditions. Meeting timing requirements under

the worst PVT conditions is harder than meeting those requirements under typical conditions,

because circuit paths have longer delays.

The test vehicle used in this chapter is a Hennessey and Patterson’s 32-bit DLX pipeline

microprocessor [77] downloaded from opencores.org [78]. To illustrate the potential for design

optimization, the DLX processor was synthesized by Synopsys Design Compiler for a clock

frequency of 1 GHz under the worst PVT conditions (Design 1) and also under the typical PVT

conditions (Design 2). Both designs were constrained for the best area and power optimizations.

The design methodology used will be described in Sections 5.5 and 5.6.

Table 5.2: Post-synthesis power breakdown and area of the designs under typical PVTconditions,temp=25◦C

DesignAverage leakage power (mW) and Number of cells Average dynamic

Area (µm2)HVT SVT LVT Total power (mW)

Design 10.006 0.026 1.347 1.379

34.672 115215.07(3641 cells) (2377 cells) (7133 cells) (13151 cells)

Design 20.023

0 00.023

33.977 109930.12(12711 cells) (12711 cells)

The resulting number of cells used in the two designs from each of the three categories of

low, standard and high threshold cells is shown in Table 5.2. The optimization tool had to use a

mix of all three cell types in Design 1 to meet the performance constraint. For Design 2, it was

able to achieve the desired performance using HVT cells only. As a result, the leakage power

of Design 1 is substantially larger than that of Design 2. Also, its dynamic power is higher as

it is a larger circuit with more switching capacitances.


It should be noted that the power values in the table were obtained from an initial power

analysis to evaluate the two designs. Post-layout simulation-based power analysis will be pre-

sented in Section 5.6. The power values were obtained using the typical PVT corner, which

uses a temperature of 25◦C. Leakage power increases exponentially with the temperature and

becomes a more substantial portion of the total power. This will be addressed later in the

following sections.

The lower power consumption of Design 2 motivated the idea to design a system that has

the desired performance level under typical PVT conditions, but is equipped with a PVT-

aware mechanism that adjusts the run-time speed to accommodate changes in PVT conditions.

The architecture to support the PVT-aware mechanism and the corresponding design flow are

explained next.

5.4 The PVT-aware Architecture

As explained in Chapter 4, to create a PVT-aware system, an on-chip clock generation circuit is

added as shown in Figure 5.1. The chip area is divided into multiple regions and a PVT-aware

completion detection circuit is included in each region. The number of regions depends on many

parameters such as the quality of the fabrication process and the size of the design.

After receiving a clock pulse from the clock pulse generator at the center, the completion

detection circuit introduces a delay that matches the delay of the critical path of system, then

sends a completion signal to the clock pulse generator. When all completion signals are received,

the clock pulse generator generates a new clock pulse.

Figure 5.2 shows a schematic of the clock generation circuit, which is based on the two-

phase single-rail asynchronous design style [2, 12] and Dean’s dynamic clocking approach [7].

The completion detection circuit for each stage comprises a delay element and a toggle. When

the clock pulse emerges from the delay element, it is converted to a level by the toggle before it

is sent back to the C-element of the clock pulse generator. Initially, all toggle elements are reset

and so is the output of the C-element. After the reset is removed, all toggle elements change

state, causing the C-element to toggle its output, thus creating a clock pulse of width CPW at


CPG

CD1 CD2

CD3 CD4

Clock pulse

Clock pulse

Clock pulse

Clock pulse

Completion signal 1 Completion signal 2

Completion signal 3 Completion signal 4

Region 1 Region 2

Region 3 Region 4

Figure 5.1: Clock generation circuit, CD≡ Completion Detector, CPG≡ Clock Pulse Generator

CPW

Completion signal from Region 1


Completion signal from Region 2C

Delay

Toggle

Clock Pulse Generator

Completion Detection

Circuit for Region 1


Q

QSET

CLR

D

Reset

To

Clock

Tree

Figure 5.2: Clock generation circuit schematic


the output of the XOR.

Each completion detection circuit delays the clock pulse by an amount matching the critical

path of the system under the prevailing PVT conditions in its region, then the toggle changes

state. When all clock detection circuits have toggled, the C-element toggles creating a new

clock pulse.

The PVT-aware architecture results in a variable clock period. Hence, special attention

should be paid to how it communicates with its environment to ensure correct data transfers.

The problem of transferring data between unsynchronized clock domains already exists in many

high-speed systems. As such, many approaches have been suggested to minimize metastability

and data loss when different clock domains are connected. They include multi-flop synchroniz-

ers, multiplexer recirculation techniques, use of first-in-first-out buffers between different clock

domains and handshake techniques [79,80]. Similar synchronization techniques may be applied

for inter-chip and intra-chip data transfers between a PVT-aware system and its environment.

5.5 Design Flow

Figure 5.3 shows the design flow to integrate the suggested PVT-aware architecture into a

conventional standard-cell ASIC design flow [81]. A few extra steps are needed to add the

clock generation circuit to the top-level hardware description language (HDL), place the clock

generation elements appropriately and tune the delay elements.

The most important change from a conventional design flow is that the main core is syn-

thesized and laid out to meet the required clock period under typical rather than worst-case

PVT conditions. Thus, typical-case timing libraries are used for setup check and critical path

analysis during synthesis and layout, leading to much more favorable results.

5.5.1 Clock generation issues

• The first step to implement the suggested clock generation circuit of Figure 5.2 is to create

a library of delay elements in the target technology. A delay element can be implemented

as a chain of 2n inverters, where n = 1, 2, ..., N . Then, the delay of each delay element is


Synthesize the main synchronous core

(optimize for speed, power and area)

Decide on the number of PVT regions/

implement the circuit of Fig. 2 (HDL

code/synthesize into target tech.)

Connect the clock of the main core to

the output clock of the clock

generation circuit in the top-level HDL

IO placement/power planning/

floorplanning

Placement with in-place DRC/setup/

hold/leakage optimizations

Pre-CTS DRC/setup/hold/leakage

optimizations

Clock tree synthesis

Post-CTS DRC/setup/hold/leakage

optimizations

Last pass leakage optimization

Add fillers and check the design

(geometry/connectivity/antenna)

Check the timing with STA

Is the

resulting clock

period

OK?

Tune delays

and do ECO

HDL design of the main core and

functional verifications

Functional verification of the circuit/

post-layout simulations/tests

Is the

system fully

functional?

Post-route DRC/setup/hold/leakage

optimizations

Done

Start

YesNo

Yes

No

pre-place the clock generation

circuit as in Fig. 1/

Apply set_dont_touch on Clock gen.

Timing-driven (high-effort), SI-driven

(normal-effort) Routing with Fix

Antenna feature

Figure 5.3: Proposed low-power PVT-aware design flow, HDL ≡ Hardware Description Lan-guage, DRC ≡ Design Rule Check, CTS ≡ Clock Tree Synthesis, SI ≡ Signal Integrity, STA ≡Static Timing Analysis, ECO ≡ Engineering Change Order


estimated using a static timing analysis (STA) tool. The result is a table of several delay

elements and their corresponding delay values that can be used in the clock generation

circuit.

• Multiple completion detection circuits are implemented to match the delay of the critical

path. All of the completion detection circuits are designed to match the same critical path

delay. However, they are placed in different regions of the chip as shown in Figure 5.1, to

follow the regional operating PVT conditions.

• The delay element in Figure 5.2 has to be adjusted such that the delay of the loop

composed of the completion detection circuit and the clock pulse generator is equal to the

critical path of the system. The delay of the loop must be tested under different PVT

corners to ensure that it matches the critical path under all conditions. When adjusting

the delay elements, appropriate margins should be used, because different factors such as

crosstalk, inductance, IR drops, noise, etc. may affect the completion detection circuits

and the datapath elements differently.

• During synthesis, it is sufficient to insert delay elements that are approximately 25%

longer than the desired clock period. They are trimmed later, during the layout flow.

• The submodules of the clock generation circuit should be pre-placed during floorplanning

to avoid a random placement.

• After the layout is completed, the post-layout netlist, the standard delay format (SDF)

file and the standard parasitic exchange format (SPEF) file are exported to an STA tool

to test the delays.

• The loop of Figure 5.2 is examined for each completion detection circuit, using an STA

tool to check if the resulting clock period is appropriate.

• If the resulting clock period is not appropriate, the delays inside completion detection

circuits are tuned and a pass of engineering change order (ECO) is performed to fix the

layout.


• The clock pulse width determined by CPW in Figure 5.2 is tested under all PVT conditions

to ensure that the pulse width requirements of sequential elements are not violated.

• The reset signal to the system must be long enough to ensure that the delay elements get

successfully reset and all the gates and flip-flops become stable.

5.6 Case Study: PVT-aware DLX Microprocessor

The DLX processor introduced in Section 5.3 was used as a case study. It was implemented

both as a PVT-aware system and a conventional synchronous circuit. The PVT-aware design

flow of Figure 5.3 was implemented in 90nm technology using the toolset shown in Table 5.3. A

low-power design flow similar to that of Figure 5.3 with the same tools and optimizations but

without the PVT-aware implementation steps was realized for the conventional synchronous

design.

Table 5.3: Toolset

Objective Tool Version

Synthesis Design Compiler Y-2006.06-SP5

Timing and power analysis PrimeTime-PX Y-2006.06-SP3-1

Physical design SoC Encounter 5.2

Simulation ModelSim 6.3c

The shortest possible post-layout clock period of the DLX core was found to be 1.244 ns

under the worst PVT corner. Hence, the design flow of Figure 5.3 was used to implement

a PVT-aware DLX processor with the same clock period of 1.244 ns but under typical PVT

conditions. Also, the chip was divided into four regions similar to Fig 5.1, and a completion

detection circuit was placed in each quadrant.

5.6.1 Tuning delays

To simplify delay tuning, a library of delay elements in the target technology was implemented

as explained in Section 5.5.1. Delays in the clock generation circuit were chosen to be 25%

larger than needed as a starting margin value. The delays were tuned after the place and route


using an engineering change order (ECO) flow, which trimmed the delays gradually until a

desired final margin (10% in this case study) was reached.

To tune the delays after the place and route, each clock generation loop in Figure 5.2 was

analyzed by PrimeTime to find the resulting clock period. If the period was longer than needed,

the delay element was replaced by a smaller delay element from the delay library, and vice versa.

This process was repeated for each completion detection circuit. After tuning the delays, the

new netlist was fed back to Encounter to update the layout (ECO).

As explained earlier, a 10% margin was used for the clock period during delay tuning.

Because of different rise and fall delays in the clock generation circuit, successive clock periods

alternated between 1.367 ns and 1.569 ns, resulting in an effective clock period of 1.468 ns. The

clock period and the critical path of the DLX core change with PVT as shown in Table 5.4.

The chip layout of the PVT-aware processor is shown in Appendix C.

Table 5.4: Comparison of the clock period and the critical path

PVTSuccessive

Effective period Critical pathclock periods

Best 0.929ns , 1.032ns 0.9805ns 0.828nsTypical 1.367ns , 1.569ns 1.468ns 1.244nsWorst 2.312ns , 2.666ns 2.489ns 2.168ns

5.6.2 Implementing the fixed-clock counterpart

After finding the effective clock period of the PVT-aware DLX processor under typical PVT

conditions (1.468 ns), a conventional synchronous counterpart (fixed-clock) was implemented

using the same 10% clock period margin. To do so, the DLX core was constrained to a clock

period of 1.34 ns, which was to be met under all three PVT corners. All the optimizations of

Figure 5.3 were applied to the fixed-clock design.

It should be noted that the same 10% margin was used for both the PVT-aware processor

and the fixed-clock version. The actual required margin depends on many parameters such as

the fabrication quality, the size of the design, and the noise level (i.e. supply noise, clock jitter,

etc).

The important difference between the PVT-aware and the fixed-clock implementations is


that the required clock frequency has to be met even under the worst PVT corner for the

latter. The PVT-aware version meets the required clock frequency and performance under

typical conditions. As shown below, the PVT-aware system achieves a significant reduction in

power at the expense of a lower performance under conditions worse than typical.

5.7 Evaluation

In this section, the PVT-aware and the fixed-clock microprocessors are compared in terms of

power consumption and performance. Using several application programs with different power

consumptions, it is demonstrated that the leakage of the PVT-aware processor is 10X less under

typical conditions and 7X less under worst-case conditions. Other properties of the PVT-aware

microprocessor such as resilience to PVT variations and suitability for voltage scaling are also

examined.

The functionality and power consumption of the PVT-aware DLX processor and its fixed-

clock counterpart were analyzed using the three benchmark suites given in Table 5.5. The

benchmarks were compiled by DLX GCC [82]. Post-layout simulations of the circuits were per-

formed for each benchmark to record switching activities in the switching activity interchange

format (SAIF). These, together with parasitic data (SPEF files), were used by PrimeTime-PX

for simulation-based power analysis.

Table 5.5: Benchmarks

Source Benchmark

MiBench [83]

adpcm coderadpcm decodercrc32dijkstraqsort

PowerStone [84]

bcntblitcompressucbqsort

Applications from [85,86]

Bubble SortJPEG-DCTMP3-DCT32MPEG2-Bdist


5.7.1 Power and performance analysis

Typical PVT: Average power consumption and execution times under typical PVT conditions

are presented in Table 5.6. Power values in the table do not include memory and IO. The core

power is the power of the processor excluding the clock tree.

Table 5.6: Power and performance results for fixed-clock and PVT-aware DLX processors under typicalPVT conditions,temp=25◦C

Fixed-clock Processor

ProgramAverage Dynamic power (mW) Average leakage power (mW) Total average Execution time

Core Clock tree Total Core Clock tree Total power (mW) (µs)

adpcm coder 17.012 18.794 35.806 1.237 0.006 1.243 37.049 286.135adpcm decoder 18.713 18.794 37.507 1.236 0.006 1.242 38.749 1274.348crc32 17.113 18.794 35.907 1.236 0.006 1.242 37.149 286.851dijkstra 16.812 18.794 35.606 1.236 0.006 1.242 36.848 143.806qsort 19.713 18.794 38.507 1.236 0.006 1.242 39.749 729.099bcnt 17.812 18.794 36.606 1.236 0.006 1.242 37.848 44.204blit 17.511 18.794 36.305 1.237 0.006 1.243 37.548 103.950compress 16.513 18.794 35.307 1.236 0.006 1.242 36.549 1508.450ucbqsort 19.414 18.794 38.208 1.235 0.006 1.241 39.449 1399.062Bubble Sort 24.012 18.794 42.806 1.235 0.006 1.241 44.047 12.963JPEG-DCT 19.813 18.794 38.607 1.236 0.006 1.242 39.849 588.808MP3-DCT32 20.315 18.894 39.209 1.233 0.006 1.239 40.448 73.323MPEG2-Bdist 18.613 18.794 37.407 1.235 0.006 1.241 38.648 50.578Average 18.380 18.795 37.175 1.236 0.006 1.242 38.402

PVT-aware Processor

ProgramAverage Dynamic power (mW) Average leakage power (mW) Total average Execution time

Core Clock tree Total Core Clock tree Total power (mW) (µs)

adpcm coder 10.15 18.996 29.146 0.125 0.004 0.129 29.275 286.136adpcm decoder 11.25 18.996 30.246 0.125 0.004 0.129 30.375 1274.349crc32 10.05 18.996 29.046 0.125 0.004 0.129 29.175 286.852dijkstra 9.75 18.996 28.746 0.125 0.004 0.129 28.875 143.807qsort 12.45 18.896 31.346 0.125 0.004 0.129 31.475 729.100bcnt 10.85 18.896 29.746 0.125 0.004 0.129 29.875 44.205blit 10.35 18.996 29.346 0.125 0.004 0.129 29.475 103.951compress 9.65 18.996 28.646 0.125 0.004 0.129 28.775 1508.451ucbqsort 12.15 18.896 31.046 0.125 0.004 0.129 31.175 1399.063Bubble Sort 15.148 18.796 33.944 0.125 0.004 0.129 34.073 12.964JPEG-DCT 11.95 18.996 30.946 0.125 0.004 0.129 31.075 588.809MP3-DCT32 11.85 18.996 30.846 0.125 0.004 0.129 30.975 73.324MPEG2-Bdist 11.15 18.896 30.046 0.125 0.004 0.129 30.175 50.579Average 11.133 18.961 30.094 0.125 0.004 0.129 30.223


The PVT-aware processor executes all benchmark programs with the same execution time

of the fixed-clock processor under typical conditions. However, it consumes less leakage and

dynamic power. To calculate the average power values (highlighted in the table), the energy

consumption of all programs are calculated and summed, then divided by the sum of the

execution times. On average, the leakage power of the PVT-aware processor is 10X less than

that of its fixed-clock counterpart and its dynamic power is 19% less for the same performance.

The total power of the PVT-aware processor is 21% smaller, on average. Since the clock

tree power is almost equal for the two designs, only the core power contributes to the power

differences.

Table 5.7 shows how the two designs differ in some of the key parameters that affect power

consumption. The PVT-aware design is implemented mostly of HVT cells. The fixed-clock

system uses a large number of LVT and SVT cells, which have significantly larger leakage

power. Also, the fixed-clock system is a bigger circuit with about 14% more area and more

switching capacitances. Therefore, the fixed-clock design consumes more dynamic power than

its PVT-aware counterpart.

Table 5.7: Post-layout area and leakage breakdown under typical PVT corner,temp=25◦C

DesignLeakage power (µW) and Number of cells Total area of

HVT SVT LVT Total std. cells (µm2)

PVT-aware14.45 27.12 87.18 128.75

129637.86(10680 cells) (1901 cells) (662 cells) (13243 cells)

Fixed-clock5.45 85.96 1150.00 1241.41

151573.39(3821 cells) (4715 cells) (5652 cells) (14188 cells)

Worst-case PVT: The power and performance values presented above were obtained under

the typical PVT corner provided for the technology, which uses a temperature of 25◦C. Power

requirements, specially leakage power, were next examined under the worst-case PVT corner

at a temperature of 125◦C (this was the only available PVT corner with a high temperature).

Average power consumption of all benchmarks and their total execution time for this case

are presented in Table 5.8. Leakage power is a substantial portion of the total power for the

two designs under these conditions. The PVT-aware processor has a 7X lower leakage power.

The speed of the PVT-aware processor is reduced as a result of being exposed to worse-than-


typical conditions, and thus, its dynamic power is also reduced. The power-delay product of

the PVT-aware system is 2.30X smaller than that of its fixed-clock counterpart.

Table 5.8: Power and performance results under worst-case PVT,temp=125◦C

DesignAv. Leakage Av. Total Total Ex. Power-delay

power (mW) power (mW) Time (ms) product (µJ)

Fixed-clock 59.193 87.216 6.502 567.08

PVT-aware 8.092 22.344 11.023 246.30

5.7.2 Resilience to inter-chip PVT variations

The PVT-aware processor was tested under the three PVT corners. The system executed all

benchmarks correctly and tuned itself to produce the best possible results under the prevailing

PVT conditions. Figure 5.4 shows that the execution time changes as the clock period changes

under different PVT conditions. The results in this figure are consistent with the clock periods

presented previously in Table 5.4.

0

0.2

0.4

0.6

0.8

1

Best Typical Worst

Ex

ecu

tio

n t

ime

rela

tiv

e to

wo

rst-

case

of

PV

T-a

war

e

Fixed-clock PVT-aware

Figure 5.4: Performance of PVT-aware and fixed-clock DLX processors under all PVT corners

5.7.3 Resilience to intra-chip PVT variations

The PVT-aware processor was also tested for its resilience to intra-chip PVT variations. An

area in the right side of the chip equal to about 1/3 of the total area was selected. Starting


from the typical-case SDF file, the delays of all the cells in that area were augmented by 10%

using the Design Compiler’s derating commands. A new SDF file was generated to be used in

simulations, which verified that the system executed all the benchmarks correctly. Table 5.9

shows the change in clock periods with the increased delay.

Table 5.9: Clock period changes with intra-chip variations

PVT Successive clock periods Critical path

Chip under typ. 1.367ns , 1.569ns 1.244ns

Chip with1.473ns , 1.600ns 1.378ns

augmented delay

5.7.4 Suitability for voltage scaling

The 90nm technology used in the experiments is characterized for two supply voltage levels:

1.0 V and 1.2 V. These characterizations were used to apply voltage scaling to the PVT-aware

DLX processor. It was ensured that the pads were compatible with 1.2 V and no hold violation

occurred. Then, the PVT-aware processor was tested under typical PVT conditions for both

supply voltages. The system automatically adjusted its frequency to changes in supply voltage.

The clock frequency of the system with 1.2 V was 1.17 times higher than that for a voltage supply

of 1.0 V. This shows that the PVT-aware design is amenable to voltage scaling techniques.

5.8 Discussion

5.8.1 Design space

The proposed PVT-aware approach expands the design space by providing more flexibility to

power and performance trade-offs. This can be useful in the implementation of many applica-

tions, such as portable systems. The case study presented in this chapter shows that a system

can be designed to deliver the desired performance under typical conditions, which are the con-

ditions that the system is exposed to most of the time. If the conditions get worse, the system

is still functional as it automatically slows down and if the conditions get better, it will speed

up. The advantage of such a system over the fixed-clock design is an overall reduction in power


High frequencies

Low frequencies

Conventional Design

PVT-aware Design

Figure 5.5: Design Space expansion using PVT-aware design

and area, which is result of using a smaller number of high-leakage high-speed cells to reach

the desired performance.

Figure 5.5 shows the design space, which was explained for the conventional design in

Chapter 4. Using the conventional design approach, the designer of a system may trade off

performance for power. He may aim for a high clock frequency, which results in a high dynamic

power dissipation. To reach higher frequencies, a larger number of high-speed high-leakage LVT

cells are required, which increase the leakage power consumption. Similarly, the required area

is increased as the target clock frequency increases.

The proposed PVT-aware design approach adds a new curve to the design space, which

may be used to achieve the required speed with lower area and power consumption, compared

to the conventional design. The distance between the two curves in Figure 5.5 increases with

the clock frequency. At lower frequencies, conventional and PVT-aware designs both achieve

the required speed using a small number of LVT cells. However, as the target clock frequency

increases, the number of LVT cells, and thus the total leakage power of the conventional design

become more substantial than those of the PVT-aware design. Similarly, the difference between

the area of the conventional design and that of the PVT-aware design becomes larger as the


target clock frequency increases and so does the difference between their dynamic power values.

Alternatively, a PVT-aware system may be designed to deliver the performance of its fixed-

clock counterpart under worst-case conditions. This is useful when there is a hard limit on

the required clock frequency. Such a system delivers a performance better than that of its

fixed-clock counterpart under typical conditions, at the expense of higher power consumption

as the clock frequency increases. It should be noted that the clock period of the proposed

clock generation circuit is limited by the loop delays and therefore, it may not reach the clock

frequency of a fixed-clock synchronous design under worst-case conditions.

Suitability of PVT-aware systems for voltage scaling adds another degree of freedom to the

trade-offs available to the designer. PVT-aware systems automatically adjust their frequency

to the input voltage. Hence, the input voltage can be reduced using dynamic voltage scaling

techniques to conserve power when top performance is not required. On the other hand, the

input voltage may be increased to boost performance when the system is exposed to poor PVT

conditions.

5.8.2 Clock error detection

Error detection for the clock generation circuit is addressed here. The clock is started by the

reset signal. Since each clock pulse depends on the previous one and the clock generator is not

self-starting, an error detection circuit should exist to detect that the clock has stopped and to

reset the clock generation circuit.

The C-element in the clock generation circuit can mask transient pulses or glitches of up to

a certain width. However, wider glitches caused by noise at the inputs of the C-element may

mistakenly be interpreted as a completion signal. Also, transient pulses at the delay lines may

cause a completion detection to be generated at a wrong time. These may cause the clock pulse

generator to freeze, which should be detected.

In an experiment, the clock generated by the on-chip clock generation circuit was checked

using a simple test circuit and a reference clock signal. A counter, clocked by the generated clock

signal was sampled periodically in the reference clock domain. If the counter stops changing,

an error signal is generated. The reference clock frequency and the interval between samplings


should be selected appropriately with respect to the frequency range of the generated clock.

5.8.3 Expanding the PVT-aware approach

In the case study presented above, the typical PVT corner at a temperature of 25◦C provided

by the technology supplier was used. In the conventional design, important PVT corners for

performance analysis are worst-case and best-case corners and not the typical one. Therefore,

usually enough typical PVT corners are not provided by IC foundries. However, information

about more typical PVT corners with higher temperatures is required for the PVT-aware design

approach because a typical design usually might be running at a temperature higher than 25◦C.

It should be noted that the case study presented in this chapter, used four PVT regions

for a microprocessor. The actual required number of regions depends on many parameters

including the size of the design, voltage and temperature profiles over the chip, the quality of

the fabrication process and the margins used for the delays.

Different paths might have different sensitivity to PVT changes and thus, under different

PVT conditions, the critical path changes. As a result, the circuit may have multiple critical

paths. To ensure that the delay loops are always longer than the critical path, the proposed

design flow measures the delay of all paths under various PVT conditions, independent of which

path is the critical path.

It should be possible to apply the PVT-aware clocking approach in a modular way to

complex and large systems on chip (SoCs). SoCs are growing in complexity and hence, multiple

clock domains are inevitable. Complex SoCs are composed of several modules with different

clock domains, which communicate using asynchronous interfaces [87]. In the case of the PVT-

aware design, clock generation circuits can be used in different modules and similar approaches

can be used to connect the modules from different clock domains. Therefore, a large SoC should

benefit from the advantages of the proposed PVT-aware design.


5.9 Comparison to Previous Work

The idea of exploiting typical PVT conditions to improve performance has been used in different

designs including clock frequency control systems such as Dean’s STRiP processor [7] and

TEAtime [73] and also asynchronous circuits [3, 46]. However, the approach presented in this

chapter employs PVT-aware properties primarily to reduce leakage power consumption. There

are also several other differences as discussed briefly in this section.

A predecessor to the clocking scheme in this thesis was presented in Dean’s PhD thesis in the

implementation of a self-timed RISC processor. Dean proposed the clock generation structure

shown in Figure 5.6. A C-element and a pulse generator at the center are used to generate the

Figure 5.6: Dean’s clocking structure [7]

clock pulse. Tracking cells are designed to match the delay of different functional units. To do

so, several functions are selected according to the frequency of their use and partially replicated

at the transistor-level. When all the tracking cells signal the completion of their corresponding

operations, a new clock pulse is generated. Dean demonstrated a two-fold speed improvement

under typical PVT conditions using his approach. Dean’s work opened up the area of variable

or dynamic clocking.

The clock generation method presented in this thesis builds on Dean’s work and significantly

improves it. The design of tracking cells in Dean’s work is complicated because they should

accurately replicate the critical path of the corresponding function. Also, they are implemented


at the transistor level. Loads on the transistors of the functional units are imitated using

passive transistors. Replicas of circuit paths may be large and power consuming compared

to the matched delays used in this thesis. However, they more accurately model the delay of

corresponding paths. The approach presented in this thesis is easily incorporated in standard-

cell ASIC design implementation and does not require customized transistor-level design. Also,

it employs static timing analysis in the design of the matching delay elements of completion

detection circuits. As such, the design of the completion detection circuit is independent of the

function being matched. This allows the introduction of a simple design methodology that uses

conventional design tools to implement variable-clock systems with standard cells. As a result,

the proposed approach can be readily used in many applications.

A pioneering work in reducing power consumption by reducing worst-case margins is the

Razor project [8]. The Razor project shows the possibility of reducing the voltage margins

used in worst-case analysis of synchronous circuits. This work reduces power consumption by

reducing the input voltage, but keeps the clock frequency intact. The design does not guarantee

error-free operation. Hence, an error recovery circuit is added to cope with any timing errors

that may result from the reduced voltage.

As a result of operating at a reduced voltage, errors may occur in operations that require

full voltage. Such errors are detected using the Razor flip-flops shown in Figure 5.7 A shadow

logic operating with a delayed clock is used to obtain the correct result, which is compared to

the data in the main flip-flop. If there is a difference, an error signal is generated. The error

signal causes the data in the main flip-flop to be replaced with the data in the latch. Then,

the pipeline is flushed in a counter-flow approach. Simulation results show a 64% power saving

with less than 3% performance penalty in a simplified 64-bit Alpha pipeline design in 180-nm

technology.

Compared to the PVT-aware approach presented in this thesis, Razor has the advantage

of keeping the clock frequency fixed, which might be useful for fixed data rate applications.

However, implementing Razor requires both architectural and circuit changes. Implementing

Razor flip-flops and the pipeline flushing mechanism increases the area and complexity of the

system. Flushing a pipeline with many stages may result in a significant performance loss. In


Razor FF

0

1

Logic stageL2Main

flip-flop

Shadowlatch

Error_L

ErrorComparator

clk

clk_delayed

Q1D1Logic stageL1

clk

clk_delayed

D

Error

Q

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Instr 1 Instr 2

Instr 1 Instr 2

(a)

(b)

Figure 5.7: Razor flip-flop for a pipeline stage. (a) A shadow latch controlled by a delayed clockaugments each flipflop. (b) Razor flipflop operation with a timing error in cycle 2 and recoveryin cycle 4. [8]


addition, Razor-based design methodology inherits the disadvantage of traditional synchronous

design where circuit is optimized for worst-case conditions. Although such over-design for

typical condition can be partially overcome by lowering the supply voltage, area overhead is

still incurred, which in turn leads to increased leakage power. In contrast, the proposed PVT-

aware mechanism targets typical-case conditions during the synthesis and physical design phases

of the implementation. This leads to smaller area and leakage. Meanwhile, clock frequency

requirement is guaranteed not only for typical conditions, but also for worst-case conditions by

raising the supply voltage.

TEAtime [73] adjusts the clock frequency as PVT conditions change. The critical path

of the system is replicated and tested with the clock frequency continuously changing using a

voltage-controlled oscillator (VCO) to find the highest suitable value for the system. A clock

frequency increase of 34% under typical PVT operating conditions compared to worst-case

conditions has been demonstrated in a FPGA implementation of a DLX-style processor. The

design methodology described in this chapter achieves better performance improvements from

worst-case to typical PVT conditions compared to TEAtime. It also adapts to intra-chip PVT

variations.

Asynchronous circuits can also be designed to adjust their speed to PVT conditions. Several

asynchronous design styles exist including desynchronization [46, 53], in which a synchronous

design is converted into an asynchronous one, and Mousetrap [3], which is a methodology to

design high-speed pipelines taking advantage of PVT variability. The methodology proposed

in this chapter is simpler than asynchronous design styles because conventional synchronous

design tools are used and thus, neither asynchronous design methods nor asynchronous design

tools are required. The desynchronization method introduces an area overhead of 13.5% in

a DLX microprocessor [46]. By comparison, area overhead using the presented PVT-aware

architecture is only 0.5% for a similar DLX processor.


5.10 Conclusion

This chapter proposed the use of PVT-aware design to implement systems that are capable

of delivering the same performance as conventional synchronous circuits under typical PVT

conditions, but with much reduced power requirements. This chapter presented a complete

design solution for PVT-aware systems, including a PVT-aware architecture and a design flow

to implement such systems using standard-cell ASICs. The suggested methodology expands the

digital design space for many applications with low power and high performance requirements,

such as portable devices.

The case study of the DLX microprocessor has demonstrated that the PVT-aware system,

implemented in 90nm technology, delivers the same performance as its fixed-clock counterpart

under typical PVT conditions, with 10X less leakage and 19% less dynamic power. The clock

frequency changes automatically to produce the best-possible results under the prevailing PVT

variations. It has also been shown that voltage scaling techniques can be applied to PVT-aware

systems, which automatically adjust their speed to the input voltage.

Chapter 6

VariPipe: Variable-clock

Synchronous Pipelines

6.1 Introduction

In many pipelined systems, such as microprocessors, the time required to complete an op-

eration in any given stage of the pipeline depends on the operation being performed. In a

conventional synchronous system, the delay of the longest path of the pipeline under the worst

process-voltage-temperature (PVT) corner is used to determine the clock frequency. However,

the longest path of the system is not necessarily triggered in every cycle. Also, the system

is normally operating under typical PVT conditions. Hence, there are many times when a

frequency much higher than that derived under worst conditions is possible.

This chapter introduces a variable-clock synchronous pipeline design (VariPipe), in which

the clock period is adjusted in each clock cycle based on the operations taking place in the

pipeline stages [88]. An on-chip clock generation circuit dynamically matches the delay of

the current operations of the pipeline in every cycle. At the same time, the clock period au-

tomatically adjusts to the current PVT conditions. The proposed approach achieves better

performance than isochronous clocking while retaining the simplicity of synchronous system

design. Other advantages of variable-clock synchronous pipelines include a reduction in elec-

tromagnetic noise and suitability for voltage scaling techniques. These features make variable

88

Chapter 6. VariPipe: Variable-clock Synchronous Pipelines 89

clocking appealing for many applications including embedded systems and portable devices.

Several studies have been published that address variable-speed pipelines, including Tele-

scopic units [89] and a variable-clock pipeline processor introduced by Dean [7]. Asynchronous

design methodologies such as desynchronization [46] and Mousetrap [3] have also been proposed

to achieve average-case performance. The main advantage of VariPipe over previous work is

that the overhead incurred by the added clock generation circuit is low, thanks to the use of

variable delay elements. According to the case study presented in this chapter, the overhead of

the added clock generation circuit for a VariPipe DLX processor is only 2.6% in area and 3% in

energy consumption. Dean’s approach achieves the same performance improvement of 2X over

conventional isochronous design as the VariPipe processor. However, duplicates of functional

units double the area and energy consumption of the functional units. The desynchronization

method introduces a 13.5% area overhead in a DLX processor [46]; the processor does not adjust

its speed based on the operations in the pipeline and therefore its performance gain is limited.

VariPipe methodology is based on a synchronous circuit implementation, and thus, many

challenges in the design of asynchronous circuits are avoided. Also, it employs a simple design

methodology using standard cells and conventional synchronous design tools, which allows

designers to use the proposed approach in many applications. A comprehensive comparison to

related work is presented in Section 6.8.

Section 6.2 describes the basic idea of the VariPipe approach and Section 6.3 explains the

methodology in more detail. Timing constraints are given in Section 6.4. A design flow to

implement VariPipe application specific integrated circuits (ASICs) is presented in Section 6.5,

and the proposed approach is demonstrated and evaluated through a microprocessor case study

in Sections 6.6 and 6.7.

6.2 VariPipe: The Idea

Consider a pipelined system consisting of several pipeline stages of combinational logic, sepa-

rated by pipeline registers. Each pipeline stage may have different modes of operation exercised

by different instructions as they flow through the pipeline stage. For example, the execution


stage of a RISC processor may execute different operations such as addition, bitwise logical

operations, etc. The key observation is that these operations activate different paths and thus,

they have different delays. In the isochronous clocking scheme, employed in today’s dominant

Electronic Design Automation (EDA) methodology, the clock period is constant, which means

it must be longer than the delay of all possible operations in the pipeline at all times. VariPipe

employs a clocking scheme in which the clock period continuously tracks the maximum delay

under current PVT conditions for all operations currently being performed. As Figure 6.1

shows, a variable delay is associated with each pipeline stage, and its delay is adjusted to

match the delay of the current operation in the stage, as determined by the data in that stage’s

Stage 1 Stage 3

Variable Delay 1 Variable Delay 3


Variable Delay 2

Critical path

Shorter path

Figure 6.1: VariPipe technique

input registers. When the delays of all pipeline stages have elapsed, the clock pulse generator

creates a new clock pulse. As a result, some clock cycles are shortened and the overall speed is

increased. The variable delay unit is placed close to the corresponding datapath to be subject

to the same PVT conditions.

Note that although the proposed architecture benefits from its asynchronous nature, the use

of asynchronous design is limited to the clock generation circuit, leaving the rest of the system

still a synchronous circuit that can be designed, synthesized, and laid out using a traditional

design flow.


6.3 Design Methodology

In this section, the methodology for the design and implementation of VariPipe systems is

described in detail. The design process starts with a high-level hardware description of the

system and its implementation in the target technology. Adding the VariPipe facilities involves

three steps: creating delay profiles, simplifying the delay profiles and implementing the clock

generation circuit.

6.3.1 Creating delay profiles

Different operations in any stage of the pipeline can be identified from the high-level hardware

description of the system. Each operation takes the values in the input registers and saves its

result in the output registers. The result of an operation may not be needed in every cycle,

as determined by the selection signals for that operation. The different operations that can be

performed in any pipeline stage and the conditions under which the results of those operations

are selected are recorded in an operation selection table. The case study below shows that

operation selection tables are easy to prepare, because they are constructed using a high-level

description of the system without using the low-level implementation details.

The maximum delay of any operation of the pipeline stage can be determined from the

low-level implementation of the system in the target technology. There are two methods to find

the delays: I) Dynamic timing analysis (DTA), which finds the delays using test vectors. II)

Static timing analysis (STA), which is used here. For each pipeline stage, the delays of all the

operations are found, and a delay profile is created by grouping the operations of the pipeline

stage according to their delay values. The case study below presents a simple and automated

approach to construct delay profiles.

6.3.2 Simplifying delay profiles

Each pipeline stage has a minimum delay that can be identified from the delay profile of

that stage. The path having the largest minimum delay of all pipeline stages is the shortest

inevitable path of the pipeline. To reduce the number of delay values needed, delay values less


CPW

Completion signal from Stage 1

Completion signal from Stage 3Completion signal from Stage 2C

Delay SelectorInput to

the stage 1

Variable Delay Toggle


Completion Detection

Circuit for Stage 1

S

Operation

Selection

Table

Figure 6.2: Clock generation circuit

than the delay of the shortest inevitable path in each profile are grouped and rounded up to

the maximum value of the group. This simplifies the delay profile and the implementation of

the clock generation circuit.

6.3.3 Implementing the clock generation circuit

Figure 6.2 shows the clock generation circuit, which is composed of two parts: the completion

detection circuits and the clock pulse generator. The design of the clock generation circuit

is based on the two-phase single-rail asynchronous design style [2, 12] and thus inherits many

properties from asynchronous systems. The completion detection circuit for each stage is com-

posed of a variable delay, a toggle, an operation selection table and a delay selector. The delay

selector reads appropriate signals from the inputs to the pipeline stage. It uses the operation

selection table, which is ordered according to the delay values, to generate a one-hot delay

selection signal (S) to select the appropriate delay value. If the inputs to the pipeline stage

activate more than one operation (e.g., in a complex multi-task stage), the delay corresponding

to the operation with the longest delay is selected.

When the clock pulse emerges from the variable delay element, it is converted to a level

by the toggle before being sent to the C-element. Initially, all toggle elements are reset and

so is the output of the C-element. After the reset is removed, all toggle elements change state


causing the C-element to toggle its output, thus creating a clock pulse of width CPW at the

output of the XOR. The clock pulse loads new values into the input registers of each pipeline

stage.

Note that the delay through the delay elements must be at least long enough for the corre-

sponding delay-selection signals to become valid. After a delay matching the operation with the

longest delay currently in the stage, the toggle changes state. When all stages have switched

to the new state, the C-element toggles creating a new clock pulse.

6.3.4 Variable delay implementation

Consider a pipeline stage whose delay profile has three values, d1, d2, and d3 (d3 > d2 > d1).

The first two delays are selected by signals S1 and S2. Otherwise, d3 is selected. The design

of the variable delay and the output toggle for that stage are illustrated in Figure 6.3. The

values of the three delay elements k1, k2 and k3 are selected such that the total delay around

the clock loop in Figure 6.2 matches the stage’s delay profile d1, d2 and d3.

k2k1

S1S2

Q

QSET

CLR

D

Reset

Input

Outputk3

Figure 6.3: Variable delay and toggle

Delay k1 in Figure 6.3 consists of a long chain of gates which change state twice with

every input pulse. To reduce power consumption, the delay architecture shown in Figure 6.4

is proposed. The input pulse to the delay chain is converted to a level, then at the end of the

chain, converted back to a pulse. As a result, the gates composing delay L1 switch only once

with each input pulse. Delay L1 should be tuned such that the minimum delay of the path

between the input and the output matches the desired delay, k1, and delay element PW should

be adjusted to generate a suitable pulse width. Simulations showed that the power saving

achieved by this technique is close to 50% for long delay chains.


Q

QSET

CLR

Reset

Input Output

InputPW

Output

Figure 6.4: Reducing the switching power of delay element

6.4 Timing Constraints

Figure 6.5 shows a simplified model of the clock generation circuit with three completion de-

tection circuits. The timing constraints on the design of the clock generation circuit may be

summarized as follows:

• The reset signal to the system must be long enough to ensure that the delay elements are

successfully reset and all the gates and flip-flops become stable.

• Each loop in Figure 6.5 has different rise and fall times. The minimum delay of the loop

should be used for delay tuning.

• Each completion detection circuit must be placed within the corresponding stage to ensure

that it matches the datapaths’ delays under the prevailing PVT conditions in that stage.

Figure 6.5: A simplified model of the clock generation circuit


When adjusting the delay elements, appropriate margins should be used because factors

such as crosstalk, IR drops, noise, inductance, etc. may affect the datapath and the

completion detection circuit differently.

• Part of the delay of the loops in Fig 6.5 is the clock pulse generator and the clock tree

delay. The clock pulse generator and the root cells of the clock tree are not necessarily

close to the pipeline stage and their PVT conditions may be different. Therefore, the

clock pulse generator and clock tree delays must be used with appropriate margins when

tuning delays. In the case study presented below, only 90% of the clock tree and the clock

generation circuit delay is taken into account, thus ensuring that the total delay around

each of the clock loops is slightly larger than the required delay.

• The delays of the clock generation loops should be tested under all PVT corners to ensure

that the delay elements inside the loops are sufficiently large.

• Delay selection signals Si in Figure 6.2 and Figure 6.3 must become valid before the

input clock pulse emerges from the first delay element (k1), and hence, delay k1 must be

sufficiently long.

• The clock pulse width determined by CPW in Figure 6.2 and the pulse width determined

by PW in Figure 6.4 should be tested under all PVT conditions to ensure that the pulse

width requirements of sequential elements are not violated.

• Communication of a variable-clock system with its environment needs special attention to

ensure correct data transfers. The problem of transferring data between unsynchronized

clock domains already exists in many high-speed systems, and many approaches are in use

to minimize metastability and data loss when different clock domains are connected. They

include multi-flop synchronizers, multiplexer recirculation techniques, use of first-in-first-

out buffers between different clock domains and handshake techniques [79, 80]. Similar

synchronization techniques may be applied for inter-chip and intra-chip data transfers

between a VariPipe system and its environment.


6.5 Design Flow

Fig 5.3 shows the proposed design flow to implement VariPipe systems using standard cells.

The design flow is explained in detail in the following case study.

6.6 Case Study: VariPipe DLX Microprocessor

To test the performance of a variable-clock synchronous pipeline, a VariPipe version of Hen-

nessey and Patterson’s 32-bit DLX pipeline microprocessor [77] was implemented in 90nm

technology. The Verilog code of the processor was downloaded from opencores.org [78]. The

DLX core is a RISC microprocessor with five pipeline stages: instruction fetch, instruction

decoding, instruction execution, memory access and write back. To implement the processor,

the design flow of Figure 6.6 was realized using the toolset shown in Table 6.1.

The main synchronous core was constrained to a clock period of 8.73 ns to accommodate the

worst PVT corner. Then, two versions of the processor were generated: one version equipped

with the VariPipe technique and the other a conventional synchronous circuit (fixed-clock).

Both designs were optimized for minimum power and area.

Table 6.1: Toolset

Objective Tool Version

Synthesis Design Compiler Y-2006.06-SP5

Timing and power analysis PrimeTime-PX Y-2006.06-SP3-1

Physical design SoC Encounter 5.2

Simulation ModelSim 6.3c

6.6.1 Implementing the VariPipe DLX processor

According to the design flow of Figure 6.6, the first step after obtaining the behavioral HDL

of the DLX processor is to analyze the design to identify the operations of each pipeline stage

and the conditions under which each operation is selected. The execution unit and the decoder

are given here as examples.

Execution unit: Part of the behavioral Verilog code of the execution unit is shown in Fig-

ure 6.7. The execution unit performs a range of tasks including logical and arithmetic opera-


HDL design of the synchronous main

core and functional verification

1) Analyze the design to find the

operations of each stage

2) Find conditions under which an

operation is activated

Operation

Selection

Tables

Synthesize the main synchronous core

with timing/power/area constraints

Synthesized HDL code of the

synchronous design

Find the delay of each operation

Simplify delay profiles & implement

the clock generation circuit of Fig. 2

(HDL code/synthesis in the target

technology)

Pre-layout

Delay profiles

Placement (with in-place opt.)

Pre-CTS opt.

Clock tree synthesis

Post-CTS opt.

Routing (with opt.)

Post-routing opt.

Add fillers and check the design

(connectivity/geometry/antenna)

Connect the clock gen. to the main

synch. design in the top-level HDL

IO placement/power planning/

floorplanning (including inserting the

completion detector of each stage

inside that stage)/apply

set_dont_touch on clock gen.

Find the delay of each operation

Simplify delay profiles & modify the

clock generation circuit of Fig. 2

according to new delay profiles

Post-layout

Delay profiles

Check timing with STA

Are the delays

OK?

Tune delays

and do ECO

No

Post-layout simulations and tests/

Design verification

Yes

ECO

Post-layout netlist,

SDF & SPEF

Figure 6.6: Proposed VariPipe design flow, HDL ≡ Hardware Description Language, DRC ≡Design Rule Check, STA ≡ Static Timing Analysis, ECO ≡ Engineering Change Order, SDF≡ Standard Delay Format, SPEF ≡ Standard Parasitic Exchange Format, CTS ≡ Clock TreeSynthesis


tions on input registers A and B and places the result into the ALU result register. The results

of these operations are available on the intermediate signals ADD result, AND result and

SUB result. One of these signals is then selected as the output on ALU result based on the

‘ d e f i n e ADD 6b‘100000

‘ d e f i n e SUB 6b‘100010

‘ d e f i n e AND 6b‘100100

. . .

a s s i gn ADD resu l t = reg A + reg B ;

a s s i gn SUB re su l t = reg A − reg B ;

a s s i gn AND resu l t = reg A & reg B ;

. . .

i f ( I R o p c o d e f i e l d == 0) / / R−t y p e f o rma t i n s t . or NOP

case ( I R f u n c t i o n f i e l d )

‘ADD: ALU resu l t <= ADD resu l t ;

‘SUB : ALU resu l t <= SUB re su l t ;

‘AND: ALU resu l t <= AND resu l t ;

. . .

Figure 6.7: Verilog code of the Execution unit

instruction opcode field and instruction function field, which are available in the input registers

of the execution unit. Thus, the operation selection table of the execution unit can be derived

as in Table 6.2.

Table 6.2: Operation Selection Table of Execution Unit

OperationSelection signals (Si)

IR opcode field IR function field

ADD 0 6’b100000SUB 0 6’b100010AND 0 6’b100100

... ... ...

Decoder: The decoder is responsible for generating the branch signal, which declares that

a branch has to be taken in the next cycle. The decoder also computes the branch address

and sends it to the fetch unit. The result of this computation is needed only if the branch is

to be taken. Therefore, when the branch signal becomes valid and if it is equal to zero, there

is no need to wait for the computation of the branch address to be completed. The operation

selection table of the decoder is shown in Table 6.3.


Table 6.3: Operation Selection Table of Decoder

Operation Selection signals (Si)

Branch signal 1 (always computed)Branch address Branch signal

... ...

After synthesizing the main core with the design constraints, pre-layout delay profiles of the

pipeline stages are extracted using the STA tool and operation selection tables. Delay profile

extraction and simplification are explained later for post-layout delay profiles as the process is

the same.

To implement the clock generation of Figure 6.2, the operation selection table of each

pipeline stage is used to create the delay selection logic. Delay elements are not fine-tuned at

this stage as the delays will change during layout. Initially, delay elements are selected to be

around 30% larger than needed.

To simplify the implementation of the clock generation circuit, a library of delay elements

in the target technology is created. A delay element is implemented as a chain of 2n inverters,

where n = 1, 2, ..., N . Then, the delay of each delay element is estimated using the STA tool.

The result is a table of several delay elements and their corresponding delay values. The clock

generation circuit is completed using these delay elements.

The clock generation circuit is then connected to the main synchronous core in the top-level

HDL used for the layout flow. Most layout steps are similar to the conventional synchronous

design flow [81]. The main difference is that the completion detection circuit of each stage is

constrained to be placed inside that stage.

After the place and route steps are completed, post-layout delay profiles are created. The

list of operations for which delay values are needed is readily available from the operation

selection tables. The delays of various operations are found using static timing analysis (STA).

The compiler’s STA facility enables the designer to obtain the longest delay in any pipeline

stage. However, constructing the delay profiles requires information about the longest path

for each of the operations in the operation selection table. The required information can be

obtained using the STA facility as follows. To find the delay of a given operation of a pipeline


stage, the corresponding selection fields in the input registers of the stage are set to the values

that select that operation, using assign statements in the HDL netlist. These values will in

turn, set the corresponding selection signals for that operation and the delay reported by the

STA tool will be the delay of the desired operation. This process can be automated using an

appropriate script.

As an example, operations ADD, SUB and AND of the execution stage save their re-

sults in the ALU result register. To find the delay of the ADD operation, selection signals

IR opcode field and IR function field are set to 0 and 6’b100000 in the post-layout netlist, as per

the information in Table 6.2. As a result, the delay from the input registers to the ALU result

register reported by the STA tool is the desired delay.

The delay profiles of the execution unit and the decoder under the worst PVT corner are

given in Table 6.4, with a 10% margin. These are the values to be matched by the delay

elements. According to the delays in the table, the critical path corrosponds to the SUBI

operation in the execution unit. It has a delay of 9.6 ns, including a 10% margin.

Table 6.4: Post-layout delay profiles of Decoder and Execution unit

Decoder

Operation Delay + 10% margin (ns)

Branch address 9.06Slot number 6.35Branch signal 6.07

... ...WriteBack index 0.59

Execution Unit

Operation Delay + 10% margin (ns)

SUBI 9.60... ...

SLT 7.60SRLI 6.91

... ...NOP 2.27

The next step is to simplify the delay profiles. The longest delay of the decoder is the branch

address calculation. The branch signal determines if this calculation is needed. Therefore, the

branch signal computation is an unavoidable operation and its corresponding path is the shortest


inevitable path of the decoder, which is 6.07 ns (with the 10% margin). The simplified view

of the clock generation circuit previously given in Figure 6.5 may be used for the clock period

calculation. To calculate the clock period corresponding to the shortest inevitable path of the

decoder, the delay of the path is augmented by the delay of the toggle (0.35 ns), the clock pulse

generator (0.45 ns) and the clock tree (0.31 ns). As a result, the shortest possible clock period

is 7.18 ns. Since this is more than the delay of the other operations of the decoder (except

Branch address computation), the decoders delay profile can be simplified to two delays, as

shown in Table 6.5.

Table 6.5: Simplified delay profiles

Decoder unit

Operation Delay (ns)

Branch address 9.06All others 7.18

Execution unit

Operation Delay (ns)

SUBI 9.60... ...

SLT 7.60SRLI, and all 6.91

others

The minimum delay of the decoder in Table 6.5 is larger than the minimum delay of all

other stages, making it the shortest inevitable path of the pipeline. Hence, delays less than

7.18 ns in the delay profile of other units were grouped and rounded up to the maximum of the

group. In the case of the execution unit, all delays equal to or less than 6.91 ns were grouped

together, as shown in Table 6.5. The clock generation circuit was modified according to the

new delay profiles, followed by an engineering change order (ECO) pass to update the layout.

The next step according to Figure 6.6 is to fine-tune the delay elements. The physical design

tool is used to write the post-layout HDL netlist along with the standard delay format (SDF)

file and the standard parasitic exchange format (SPEF) file for post-layout STA. The long delay

chains used during synthesis are adjusted by replacing them with other delay elements from

the delay library, then tested by the STA tool to check if they are of appropriate length. This

process is repeated until appropriate delay values are achieved. An ECO pass is done to update

the layout with the modified delays.

It should be noted that the delays were chosen to be longer than needed initially. During


the delay tuning, they were trimmed and then tested using the STA tool. Thus, the ECO

update was used only once at the end of the delay tuning process, followed by a final check of

the delays.

The longest delay for the Si signals of each pipeline stage was compared against the k1

delay (see Figure 6.3) to ensure that the delay selection signals are settled before the input

pulse emerges out of the k1 delay element. In our experiments, delay k1 was sufficiently larger

than the Si signals delays.

The PVT corners for the 90nm technology used in the experiments are shown in Table 6.6.

Delays were tuned under the worst PVT corner with the 10% margin. Since different paths may

have different levels of sensitivity to PVT variations, delays and data paths were also examined

under the typical and best PVT corners to ensure that the delays are sufficiently large. As

explained in the next section, the VariPipe processor was simulated under all PVT corners to

verify correct functionality.

Table 6.6: PVT corners

PVT corner Process Voltage Temperature

Best Fast 1.1 -40◦CTypical Typical 1.0 25◦CWorst Slow 0.9 125◦C

6.7 Evaluation

In this section, different characteristics of VariPipe and fixed-clock processors are compared.

Also, the area and energy overhead of the clock generation circuit are quantified. Function-

ality, performance and energy consumption of the VariPipe DLX processor and its fixed-clock

counterpart were analyzed using the three benchmark suites shown in Table 6.7, which were

compiled by DLX GCC [82]. Post-layout simulations of the circuits were performed for each

benchmark and switching activities were recorded in the switching activity interchange format

(SAIF). These, together with parasitic data (SPEF files), were used by PrimeTime-PX for

simulation-based power analysis.


Table 6.7: Benchmarks

Source Benchmark

MiBench [83]

adpcm coderadpcm decodercrc32dijkstraqsort

PowerStone [84]

bcntblitcompressucbqsort

Applications from [85,86]

Bubble SortJPEG-DCTMP3-DCT32MPEG2-Bdist

6.7.1 Performance analysis

Figure 6.8 shows execution times under best, typical and worst PVT conditions. The perfor-

mance of the fixed-clock system is the same under all conditions, but the performance of the

VariPipe system varies with PVT conditions as shown. Table 6.8 shows the execution time

01000

2000300040005000

600070008000

900010000

adpc

m_c

oder

adpc

m_d

ecod

er

crc3

2

dijk

stra

qsor

tbc

nt blit

com

pres

s

ucbq

sort

Bub

ble

Sort

JPEG

-DCT

MP3-

DCT32

MPEG

2-Bdi

st

Ex

ecu

tio

n T

ime

(µs)

VariPipe Best VariPipe Typ VariPipe Worst Fixed-clock

Figure 6.8: Performance of VariPipe and fixed-clock DLX processors under all PVT corners


Table 6.8: Execution time reduction percentage using VariPipe

ProgramReduction % Reduction %

under worst under typ

adpcm coder 14.0 51.3adpcm decoder 13.4 51.0crc32 14.5 51.7dijkstra 12.8 50.6qsort 10.6 49.4bcnt 12.4 50.4blit 11.4 50.0compress 14.1 51.4ucbqsort 9.4 48.7Bubble Sort 11.0 49.6JPEG-DCT 15.9 52.4MP3-DCT32 15.0 51.9MPEG2-Bdist 15.4 52.1

Average 12.7 50.6

reduction percentage obtained using VariPipe. Under the worst-case conditions, the execution

times of benchmarks are 13% shorter, on average, for the VariPipe design, because the VariPipe

system adjusts the clock period in each cycle to match the current operations. The reduction in

execution time varies with the frequency of occurrence of the instructions and their sequence in

the program. The VariPipe system is twice as fast as the fixed-clock counterpart under typical

conditions. The average percentages in the table are calculated by summing the execution times

of all programs.

It should be noted that in the case of deep pipelines with shallow logic in each stage, the

difference between the delay of operations may be negligible. Hence, the gain that can be

achieved with a VariPipe design may be small. In those cases, VariPipe can still improve the

performance by adjusting the clock period to the prevailing PVT conditions. This performance

gain should be carefully compared against the overhead of the clock generation circuit.

6.7.2 Energy consumption analysis

Energy consumption of the two processors under typical PVT conditions is compared in Ta-

ble 6.9. Energy values in the table do not include memory and IO. The core energy is the

energy consumption of the processor excluding the clock tree. The VariPipe system consumes


only 3% more energy than the fixed-clock period system.

Table 6.9: Energy consumption under the typical PVT corner

Fixed-clock processor VariPipe processor

Program Energy (µJ) Execution time Energy (µJ) Execution timeCore Clock tree Total (µs) Core Clock tree Total (µs)

adpcm coder 2.464 4.741 7.205 1871.1 2.663 4.782 7.445 910.3adpcm decoder 12.309 21.151 33.460 8333.6 13.012 21.323 34.335 4084.1crc32 2.450 4.748 7.198 1875.8 2.697 4.789 7.486 906.7dijkstra 1.255 2.380 3.635 940.4 1.324 2.401 3.725 464.1qsort 7.829 12.091 19.920 4767.9 8.075 12.195 20.270 2412.5bcnt 0.419 0.732 1.151 289.0 0.441 0.738 1.179 143.2blit 0.948 1.722 2.670 679.7 1.008 1.737 2.745 340.6compress 12.340 25.006 37.346 9864.5 13.596 25.218 38.814 4792.4ucbqsort 14.427 22.934 37.361 9039.7 15.224 23.404 38.628 4634.5Bubble Sort 0.165 0.215 0.380 84.7 0.168 0.217 0.385 42.7JPEG-DCT 5.830 9.769 15.599 3850.5 6.203 9.852 16.055 1831.5MP3-DCT32 0.721 1.223 1.944 479.5 0.773 1.232 2.005 230.4MPEG2-Bdist 0.461 0.844 1.305 330.7 0.505 0.846 1.351 158.3Total energy,

61.618 107.556 169.174 42407.1 65.689 108.734 174.423 20951.3Total time

6.7.3 Area and energy overhead

The clock generation circuit takes up only 2.6% of the total area of the VariPipe processor.

The area of the clock generation circuit is mainly taken by the delay elements. As previously

mentioned, the energy overhead of using VariPipe is merely 3%. The clock generation circuit

is a very small portion of the total area and is implemented by low-leakage cells and thus, its

leakage power is negligible compared to that of the processor.

6.7.4 Resilience to PVT variations

The VariPipe system has been shown to work correctly under all PVT conditions. It au-

tomatically adjusts to inter-chip and intra-chip PVT variations to deliver the best-possible

performance. Figure 6.8 represents the inter-chip PVT variation analysis of the VariPipe DLX

processor. The execution times drop by as much as 62% from the worst to the best PVT condi-

tions. The VariPipe system is also resilient to intra-chip PVT variations. To test this, starting

from the typical-case SDF file, the delays of the execution unit and its completion detection


circuit were augmented by 10% using the Design Compiler’s derating commands. The other

units were not changed. A new SDF file was generated to be used in simulations, which verified

that the system executed all the benchmarks correctly.

6.7.5 Reduction in electromagnetic noise

The clock is often the main source of electromagnetic noise in a digital system, because it has

a fixed frequency that is also the highest in the system [90]. Many circuits employ spread-

spectrum oscillators to overcome this problem [91]. In VariPipe systems, the clock frequency

varies within a range around an average value and thus, the clock power is spread over that

range. The clock power spectrum of the VariPipe DLX processor under the worst PVT corner

for one of the benchmarks is compared against its fixed-clock counterpart in Figure 6.9. The

maximum clock power of VariPipe is about 28 dB less than that of the fixed-clock design.

With no central peak in the frequency spectrum, the VariPipe processor should generate less

electromagnetic noise compared to its fixed-clock counterpart.

6.7.6 Suitability for voltage scaling

The 90nm technology used in the experiments is characterized for two supply voltage levels:

1.0 V and 1.2 V. These characterizations were used to apply voltage scaling to the VariPipe

processor. It was ensured that the pads were compatible with 1.2 V and no hold violation

occurred. The system was simulated under typical PVT conditions for both supply voltages.

The system automatically adjusted its speed to changes in supply voltage. It was 1.2 times

faster using the 1.2 V supply, compared to the 1.0 V supply. This shows that the VariPipe

design is amenable to voltage scaling techniques.

6.8 Related Work

As mentioned in Section 6.1, other studies have been published to design variable-speed pipelines

and use typical PVT conditions. In this section, some of these studies are reviewed and com-

pared against VariPipe.


1 2 3 4 5 6 7 8 9 10

x 108

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Frequency (Hz)

Clo

ck p

ow

er (

dB

)

(a) VariPipe

1 2 3 4 5 6 7 8 9 10

x 108

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Frequency (Hz)

Clo

ck p

ow

er (

dB

)

(b) Fixed-clock

Figure 6.9: Comparison of the clock power spectra


As mentioned in the section 5.9, variable clocking was first addressed in Dean’s PhD the-

sis [7]. Dean uses partial duplicates of the functional units in each of pipeline stages as tracking

cells. As mentioned before, duplicates may introduce a substantial area overhead, which in

turn, increases the power consumption considerably. VariPipe uses variable delays, which have

a significantly smaller overhead compared to the tracking cells in Dean’s method. The case

study showed that the overhead of the added clock generation circuit for a VariPipe DLX pro-

cessor is 2.6% in area and 3% in energy consumption. The VariPipe DLX processor and Dean’s

both achieve 2X performance improvement over isochronous design. The design of tracking

cells in Dean’s work is complicated as they depend on the funciton being matched and also

they are implemented at the transistor level. Loads on the transistors of the functional units

are imitated using passive transistors. However, compred to matched delays, Dean’s tracking

cells can better model the delay of the corresponding path. VariPipe employs static timing

analysis in the design of matched delay elements for completion detection circuits. As such, the

design of the completion detection circuit is independent of the function being matched. This

allows the introduction of a simple design methodology that uses conventional design tools to

implement variable-clock systems with standard cells. As a result, the proposed approach can

be readily used in many applications.

Telescopic units are introduced in [89] to design variable-speed pipelines. A fixed clock

period shorter than the delay of the critical path is applied to the pipeline. When the critical

path is triggered, a hold signal is raised to show that another clock cycle is required for the

instruction to complete. Since the critical path of each pipeline stage is not triggered in every

cycle, an overall throughput improvement of 27% has been achieved. In comparison, VariPipe

adjusts to the current instructions in the pipeline as well as present PVT conditions, and hence,

achieves a better performance improvement.

TEAtime [73] uses a replica of the critical path to track PVT variations in a DLX-style

processor on FPGA and achieves a 34% speed improvement. The processor does not change its

speed with instructions and it is not resilient to intra-chip PVT variations.

The work presented in [92] follows the same goals of VariPipe. Epassa, Boyer and Savaria

first find the latencies of instructions in all pipeline stages and save them in memory. These data


are used at run time for variable clocking using a PLL and a variable-period clock synthesizer.

A 17% speedup for one of the test programs has been achieved. This approach is limited by

the number of the phases that the PLL can produce. Also, the clock does not adjust to PVT

variations automatically.

The Razor project [8] shows the possibility of reducing the voltage margins used in worst-

case analysis of synchronous circuits. This work reduces power consumption by reducing the

input voltage, but keeps the clock frequency intact. An error recovery circuit is added to cope

with any timing errors due to the reduced voltage.

Asynchronous circuits can also be designed to achieve average-case performance and ad-

justability to PVT variations. Several asynchronous design styles exist, including desynchro-

nization [46], by which a synchronous design is converted into an asynchronous one, and Mouse-

trap [3], which is a methodology to design high-speed pipelines taking advantage of PVT vari-

ability. Desynchronization introduces an area overhead of 13.9% in a DLX microprocessor [46];

the processor does not adjust its speed according to the operations in the pipeline. In a design

approach by Nowick, variable delays are used in the implementation of a speculative completion

detection circuit for an asynchronous adder [93]. VariPipe is simpler than asynchronous design.

It uses standard-cell design implementation and also does not require customized transistor-

level circuits. It uses conventional synchronous design tools and thus, neither asynchronous

design methods nor asynchronous design tools are required.

6.9 Conclusion

This chapter proposed a new clocking scheme in which the clock period tracks the delay of a

pipeline on a cycle-by-cycle basis. A low-overhead clock generation circuit and a standard-cell

design flow compatible with today’s dominant EDA design methodology is demonstrated, which

should enable designers to use the proposed approach in many applications. A case study has

been presented, which demonstrates that the VariPipe DLX processor has a two-fold perfor-

mance advantage over its fixed-clock counterpart. The overhead of the added clock generation

circuit is only 2.6% in area and 3% in energy consumption. Resilience to PVT variations,


reduction in electromagnetic noise, and suitability for voltage scaling are other advantages of

VariPipe systems.

Chapter 7

Conclusion and Future Work

7.1 Conclusion

7.1.1 Summary and Contributions

This dissertation introduces a new methodology to implement synchronous systems with asyn-

chronous advantages. Asynchronous logic is used to generate the clock signal to the main

synchronous core. The resulting system automatically tunes its speed in every cycle to deliver

the best-possible performance under the prevailing PVT conditions. In the case of VariPipe,

the system responds not only to the prevailing PVT conditions but also to the operations cur-

rently taking place in the pipeline. As a result, the performance and power requirements of the

design are no longer defined by the worst-case analysis of the system. Instead, the system is

characterized with a range of deliverable performances and corresponding power requirements.

The main features of the design techniques introduced in this thesis are listed below:

• Only commercial synchronous design tools are used.

• Asynchronous tools and knowledge are not required.

• The overhead in terms of design and circuit control is minimal.

It has been shown that the proposed technique can be used to solve or diminish several

urgent problems in the deep nanometer regime as follows:

111

Chapter 7. Conclusion and Future Work 112

• Reduce power consumption and area without deteriorating performance under typical

PVT conditions.

• Improve performance with negligible area and energy consumption overhead.

• Help in handling process variations.

• Expand the design space by introducing new power-performance trade-offs, which are

particularly useful for portable devices.

These advantages make the PVT-aware, self-tuning design a viable solution for today’s

shrinking technologies.

Several other contributions have been presented in this dissertation, which may be summa-

rized as follows:

• In Chapter 3, edge-triggered flip-flops and T-elements are used to improve concurrency

in the operation of asynchronous circuits. It has been shown that by overlapping the

handshakes involved in write-after-read operations, significant speed improvements can

be achieved along with reductions in circuit area and power-delay product. Possible gains

in speed for several configurations are analyzed and experimentally examined.

• A clock generation scheme that accounts for inter-chip and intra-chip variations is pre-

sented in Chapter 4.

• A complete design flow to implement PVT-aware systems using conventional synchronous

design tools is presented in Chapters 5 and 6.

• A low-overhead PVT and operation aware clock generation circuit using only standard

cells is presented in Chapters 5 and 6.

• A technique to implement variable delays with reduced switching power is presented in

Chapter 6, which can be used in the proposed clock generation circuit as well as in

asynchronous systems.

• A new approach to find the delays of different operations of a computational unit using

static timing analysis is presented in Chapter 6.


7.2 Future Work

In this section different design and computer aided design (CAD) directions to continue the

work presented in this dissertation are suggested.

In this dissertation, a methodology to implement PVT-aware and VariPipe systems are

introduced and tested on a 32-bit processor. PVT-aware and VariPipe techniques should be

tested on bigger designs and applications. Also, several chips should be fabricated, tested and

compared against the traditional synchronous design. Properties such as performance, power

consumption, yield and electromagnetic noise should be examined under different operating

conditions for many chips.

A future research direction is to fully automate the proposed methodologies. This is espe-

cially important for the VariPipe technique because creating operation selection tables for large

designs may be difficult.

The PVT-aware design technique proposed in this dissertation may be expanded to be

modular for big designs. The system can be divided into multiple subsystems, each implemented

as a PVT-aware self-tuning design and then connected to other subsystems. This results in two

potential advantages: 1) each subsystem works with its own speed 2) the necessity of a global

clock is mitigated, which may be useful in the design of large systems on chip (SoCs). This

direction of research shares many goals and challenges with the globally asynchronous locally

synchronous (GALS) design approach.

A future research direction for VariPipe technique is to employ more fine grained com-

pletion detection techniques. In the approach presented in this thesis, when an operation is

detected, the longest possible delay for that operation is chosen. Using speculative completion

detection [93], it is possible to choose a shorter delay for an operation based on its input data.

For example, if both inputs to an adder are zero, a shorter delay can be chosen to indicate

the completion of addition. Thus, the completion of each operation can be detected more

accurately and cycle reduction can be targeted more aggressively. However, implementing a

speculative completion mechanism for each operation may increase the power consumption and

area of the circuit and complicate the design methodology. Therefore, trade-offs involved in


using speculative completion should be studied carefully.

Finally, more typical PVT corners are required for the PVT-aware digital design imple-

mentations. In the multi-corner design approach, which has been used for many years with

dominant EDA tools, the best-case PVT corner is used to check hold violations and the worst-

case PVT corner is used to check setup time and specify the clock period. The typical PVT

corner in most technologies is characterized at a temperature of 25◦C, whereas real designs

might be running typically at a higher temperature. Therefore, information about more typical

PVT corners with higher temperatures is needed.

Appendix A

Previous Publications

Parts of this thesis have based previously published, as follows:

• Using Edge-triggering in the Asynchronous Synthesis of Write-after-read Operations, Navid

Toosizadeh and Safwat G. Zaky, In the Proc. of the International Conference on Appli-

cation of Concurrency to System Design (ACSD), Xian, China, pp. 23-28, June 2008.

• Application of Concurrency in the Asynchronous Design of Write-after-read Operations,

Navid Toosizadeh and Safwat G. Zaky, Fundamenta Informaticae Journal, IOS press,

2009.

• VariPipe: Low-overhead Variable-clock Synchronous Pipelines, Navid Toosizadeh, Safwat

G. Zaky and Jianwen Zhu, accepted for publication in the IEEE International Conference

on Computer Design (ICCD), Lake Tahoe, California, October 4-9, 2009.

115

Appendix B

Balsa Code for Radix-4 Booth

Multiplier

import [balsa.types.basic]

type Imm16 is 16 bits

type Imm17 is 17 bits

type Imm3 is 3 bits

type Imm9 is array 0 .. 8 of bit

type Imm1 is array 1 of bits

type Bit9 is 9 signed bits

type A3_t is array 3 of bit

procedure mult_booth (input multiplicand_multiplier: Imm16;

output prod: Imm16) is

variable lostbit : bit

variable test : bit

variable iteration : Imm3

variable product : Imm17

variable temp : Bit9

variable case_var: Imm3

shared add_sub is

116

Appendix B. Balsa Code for Radix-4 Booth Multiplier 117

begin

product := (#product[0..7] @

((((#product[8..16] as Bit9) + temp)

as Bit9) as Imm9) as Imm17) --accumulation statement(1)

end

begin

loop

multiplicand_multiplier -> then

iteration := 4;

product := ((#multiplicand_multiplier[0..7] @

(0b000000000 as Imm9)) as Imm17);

lostbit := 0;

loop while iteration > 0 then --loop for 4 iterations.

-- Depending on the sequence of bits, update the product:

case ((((lostbit as Imm1) @ #product[0..1]) as A3_t)

as Imm3) of

0b001 then

temp := ((((#multiplicand_multiplier[8..15] @

(#multiplicand_multiplier[15] as Imm1)) as Imm9))

as Bit9);

add_sub()

| 0b010 then

temp := ((((#multiplicand_multiplier[8..15] @


as Bit9);

add_sub()

| 0b011 then

temp := (((((0b0 as Imm1) @

#multiplicand_multiplier[8..15]) as Imm9))

Appendix B. Balsa Code for Radix-4 Booth Multiplier 118

as Bit9);

add_sub()

| 0b100 then

temp := -(((((0b0 as Imm1) @

#multiplicand_multiplier[8..15]) as Imm9))

as Bit9);

add_sub()

| 0b101 then

temp := -((((#multiplicand_multiplier[8..15] @


as Bit9);

add_sub()

| 0b110 then

temp := -((((#multiplicand_multiplier[8..15] @


as Bit9);

add_sub()

end;

--save the second last bit for the next iteration:

lostbit := #product[1];

--shift product:

product := ((#product[2..16] @ (#product[16] as Imm1)

@ (#product[16] as Imm1)) as Imm17); --accumulation statement(2)

iteration := (iteration - 1 as Imm3) --accumulation statement(3)

end;

prod <- (#product[0..15] as Imm16)

end

end

end

Appendix C

Chip Layout of the PVT-aware

Processor

The areas highlighted red in Figure C.1 are the submodules of the clock generation circuit.

There are four completion detection circuits in four quadrants of the chip and a clock pulse

generator at the center.

119

Appendix C. Chip Layout of the PVT-aware Processor 120

Figure C.1: Chip layout of the PVT-aware processor

Bibliography

[1] A. Davis and S. M. Nowick, “An introduction to asynchronous circuit design,” in mono-

graph on asynchronous design, 1997.

[2] I. E. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, no. 6, pp. 720–

738, June 1989.

[3] M. Singh and S. M. Nowick, “Mousetrap: High-speed transition-signaling asynchronous

pipelines,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 6,

pp. 684–698, June 2007.

[4] N. Karaki, T. Nanmoto, H. Ebihara, S. Utsunomiya, S. Inoue, and T. Shimoda, “A flexible

8b asynchronous microprocessor based on low-temperature poly silicon tft technology,” in

Proc. of the IEEE International Conference on Solid-state Circuits Conference, vol. 1,

2005, pp. 272–598.

[5] L. Necchi, L. Lavagno, D. Pandini, and L. Vanzago, “An ultra-low energy asynchronous

processor for wireless sensor netwroks,” in Proc. of the 12th IEEE International Symposium

on Asynchronous Circuits and Systems, Mar. 2006, pp. 78–85.

[6] D. Edwards, A. Bardsley, L. Janin, L. Plana, and W. Toms, “Balsa : A tutorial guide,

version v3.5,” in http://intranet.cs.man.ac.uk/apt/projects/tools/balsa/, 2006.

[7] M. E. Dean, “STRiP: A self-timed RISC processor,” Ph.D. dissertation, Stanford Univer-

sity, 1992.

121

BIBLIOGRAPHY 122

[8] T. Austin, D. Blaauw, T. Mudge, and K. Flautner, “Making typical silicon matter with

Razor,” Computer, vol. 37, no. 3, pp. 57–65, Mar. 2004.

[9] A. Bardsley and D. Edwards, “Compiling the language balsa to delay-insensitive hard-

ware,” in Hardware Description Languages and their Applications (CHDL), Apr. 1997, pp.

89–91.

[10] “HANDSHAKE SOLUTIONS,” in http://www.handshakesolutions.com.

[11] K. van Berkel, Handshake Circuits: an Asynchronous Architecture for VLSI Programming.

Cambridge University Press, 1993.

[12] A. Peeters, “Single-rail handshake circuits,” Ph.D. dissertation, Eindhoven University of

Technology, 1996.

[13] E. Humenay, D. Tarjan, and K. Skadron, “Impact of process variations on multicore per-

formance symmetry,” in Proc. of the Design, Automation and Test in Europe, Apr. 2007,

pp. 1–6.

[14] W. Kuzmicz, E. Piwowarska, A. Pfitzner, and D. Kasprowicz, “Static power consumption in

nano-cmos circuits: Physics and modelling,” in Proc. of the 14th International Conference

Mixed Design of Integrated Circuits and Systems, June 2007, pp. 163–168.

[15] E. Macii, L. Bolzani, A. Calimera, A. Macii, and M. Poncino, “Integrating clock gating

and power gating for combined dynamic and leakage power optimization in digital cmos

circuits,” in Proc. of the 11th EUROMICRO Conference on Digital System Design Archi-

tectures, Methods and Tools, Sep. 2008, pp. 298–303.

[16] A. J. Martin and M. Nystrom, “ILLIAC II-a short description and annotated bibliography,”

IEEE Transactions on Computers, no. 3, pp. 399–403, June 1965.

[17] “Arithmetic processor 166 instruction manual.” Digital Equipment Corp., Maynard, MA,

1960.

[18] D. A. Huffman, “The synthesis of sequential switching circuits,” in Sequential Machines:

Selected Papers, E. F. Moore, Ed. Reading, MA: Addison-Wesley, 1964.

BIBLIOGRAPHY 123

[19] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” in Proc. Int. Symp.

Theory of Switching, 1959, pp. 204–243.

[20] T. J. Chaney and C. E. Molnar, “Anomalous behavior of synchronizer and arbiter circuits,”

IEEE Transactions on Computers, no. 4, pp. 421–422, Apr. 1973.

[21] A. J. Martin, “The design of a self-timed circuit for distributed mutual exclusion,” in Proc.

1985 Chapel Hill Conf. VLSI. H. Fuchs, Ed., 1985, pp. 245–260.

[22] A. Bardsley, “Implementing balsa handshake circuits,” Ph.D. dissertation, University of

Manchester, 2000.

[23] A. Bardsley and D. A. Edwards, “The Balsa asynchronous circuit synthesis system,” in

Forum on Design Languages, Sep. 2000.

[24] A. J. Martin et al., “The design of an asynchronous microprocessor,” in Proc. Decennial

Caltech Conf. Advanced Research in VLSI. C. L. Seitz, Ed., 1991, pp. 351–373.

[25] S. B. Furber et al., “A micropipelined ARM,” in Proc. of VII Banff Workshop: Asyn-

chronous Hardware Design, 1993.

[26] A. Takamura et al., “TITAC-2: an asynchronous 32-bit microprocessor based on scalable-

delayinsensitive model,” in IEEE International Conference on Computer Design, 1997, pp.

228–235.

[27] A. J. Martin et al., “The design of an asynchronous mips r3000 processor,” in Proc. 17th

Conf. Advanced Research in VLSI. Los Alamitos, CA: IEEE CS Press, 1997.

[28] A. J. Martin and M. Nystrom, “Asynchronous techniques for system-on-chip design,” Proc.

of the IEEE, vol. 94, no. 6, pp. 1089–1120, June 2006.

[29] S. M. Nowick and M. Singh, “Mousetrap: Ultra-high-speed transition-signaling asyn-

chronous pipelines,” in Computer Design, 2001, pp. 9–17.

[30] R. B. Reese, M. A. Thornton, and C. Traver, “Two-phase micropipeline control wrapper

with early evaluation,” Electreonic Letters Online no. 20040256, IEE, no. 3, 2004.

BIBLIOGRAPHY 124

[31] I. E. Sutherland and J. Ebergen, “Computers without clocks,” Scientific American, Aug.

2002.

[32] I. E. Sutherland and S. Fairbanks, “GasP: A minimal FIFO control,” in Proc. International

Symposium on Advanced Research in Asynchronous Circuits and Systems, 2001, pp. 46–53.

[33] W. A. Clark, “Macromodular computer systems,” in Proc. of the Spring Joint Computer

Conference, 1967.

[34] S. M. Burns, “Automated compilation of concurrent programs into self-timed circuits,”

Ph.D. dissertation, California Institute of Technology, 1991.

[35] R. E. Miller, Switching Theory, vol. II: Sequential Circuits and Machines. John Wiley &

Sons, New York, NY, 1965.

[36] A. M. Lines, “Pipelined asynchronous circuits,” 1995.

[37] R. O. Ozdag and P. A. Beerel, “High-speed QDI asynchronous pipelines,” in Proc. Inter-

national Symposium on Advanced Research in Asynchronous Circuits and Systems, 2002,

pp. 13–22.

[38] C. E. Molnar, I. W. Jones, W. S. Coates, J. K. Lexau, S. M. Fairbanks, and I. E. Sutherland,

“Two fifo ring performance experiments,” in Proc. of the IEEE, vol. 87, no. 2, 1999, pp.

297–307.

[39] W. P. Burleson, M. Ciesielski et al., “Wave-pipelining: A tutorial and research survey.”

IEEE Trans. on VLSI Systems, vol. 6, no. 3, pp. 464–474, Sep. 1998.

[40] O. Hauck and S. A. Huss, “Asynchronous wave pipelines for high throughput datapaths,”

in Proc. of the IEEE Conference on Electronics, Circuits, and Systems, vol. 1, 1998, pp.

283–286.

[41] B. D. Winters and M. R. Greenstreet, “A negative-overhead, self timed pipeline,” in Proc.

International Symposium on Advanced Research in Asynchronous Circuits and Systems,

Apr. 2002, pp. 32–41.

BIBLIOGRAPHY 125

[42] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D. dissertation,

Stanford University, 1984.

[43] F. K. Gurkaynak et al., “Gals at eth zurich: Success of failure?” in Proc. of the 12th IEEE

International Symposium on Asynchronous Circuits and Systems, 2006, pp. 150–159.

[44] Philips Semiconductors., “P87cl888; 80c51 ultra low power (ulp) telephony controller.”

[45] D. Caucheteux, E. Beigne, M. Renaudin, and E. Crochon, “AsyncRFID: fully asynchronous

contactless systems, providing high data rates, low power and dynamic adaptation,” in

Proc. of the 12th IEEE International Symposium on Asynchronous Circuits and Systems,

2006, pp. 86–97.

[46] N. Andrikos, L. Lavango, D. Pandini, and C. P. Sotiriou, “A fully-automated desynchro-

nization flow for synchronous circuits,” in Design Automation Conf., June 2007, pp. 982–

985.

[47] D. L. D. S. M. Nowick, M. E. Dean and M. Horowitz, “The design of a high-performance

cache controller: a case study in asynchronous synthesis,” Integration, the VLSI journal,

vol. 15, no. 3, pp. 241–262, Oct. 1993.

[48] Z. Bing, H. Yong, and Q. Yulin, “An asynchronous data-path design for viterbi decoder,” in

Proc. of the 7th International Conference on Solid State and Integrated Circuits Technology,

vol. 3, 2004, pp. 1645–1648.

[49] L. A. Plana, P. A. Riocreux, W. J. Bainbridge, A. Bardsley, J. D. Garside, and S. Temple,

“SPA - a synthesisable Amulet core for smartcard applications,” in Proc. of the 8th IEEE

International Symposium on Asynchronous Circuits and Systems, 2002, pp. 201–210.

[50] J. Ebergen, A. Chow, B. Coates, J. Schauer, and D. Hopkins, “An asynchronous high-

throughput control circuit for proximity communication,” in 12th IEEE International Sym-

posium on Asynchronous Circuits and Systems, 2006, pp. 23–33.

BIBLIOGRAPHY 126

[51] L. Fesquet and M. Renaudin, “A programmable logic architecture for prototyping clockless

circuits,” in International Conference on Field Programmable Logic and Applications, 2005,

pp. 293–298.

[52] C. Lau, “Asynchronous design FPGA,” 2004.

[53] J. Cortadella et al., “Desynchronization: Synthesis of asynchronous circuits from syn-

chronous specifications,” IEEE Transactions on COMPUTER-AIDED DESIGN of Inte-

grated Circuits and Systems, vol. 25, no. 10, pp. 1904–1921, Oct. 2006.

[54] K. M. Fant and S. A. Brandt, “NULL convention LogicTM: a complete and consistent

logic for asynchronous digital circuit synthesis,” in Proc. of the International Conference

on Application Specific Systems, 1996, pp. 261–273.

[55] M. Ligthart et al., “Asynchronous design using commercial hdl synthesis tools,” in Proc. of

the IEEE 37th Annual 2003 International Carnahan Conference on Security Technology,

1996, pp. 501–507.

[56] A. Kondratyev and K. Lwin, “Design of asynchronous circuits using synchronous cad

tools,” IEEE Design and Test of Computers, vol. 19, no. 4, pp. 107–117, July 2002.

[57] “BALSA,” in http://www.cs.manchester.ac.uk/apt/projects/tools/balsa.

[58] L. A. Plana, S. Taylor, and D. Edwards, “Attacking control overhead to improve synthe-

sised asynchronous circuit performance,” in Proc. of IEEE International Conference on

Computer Design (ICCD), Oct. 2005, pp. 703–710.

[59] T. Yoneda, A. Matsumoto, M. Kato, and C. Myers, “High level synthesis of timed asyn-

chronous circuits,” in Proc. of the 11th IEEE International Symposium on Asynchronous

Circuits and Systems, Mar. 2005, pp. 178–189.

[60] A. Bink and R. York, “Arm996hs: The first licensable, clockless 32-bit processor core,”

IEEE Micro, vol. 27, no. 2, pp. 58–68, March-April 2007.

[61] N. Toosizadeh and S. G. Zaky, “Using edge-triggering in the asynchronous synthesis of

write-after-read operations,” in Proc. of the 8th International Conference on Application

BIBLIOGRAPHY 127

of Concurrency to System Design (ACSD). Xian, China: IEEE Computer Society, June

2008, pp. 23–28.

[62] ——, “Application of concurrency in the asynchronous design of write-after-read opera-

tions,” Fundamenta Informaticae Journal, IOS Press, 2009.

[63] R. Kol and R. Ginosar, “A doubly-latched asynchronous pipeline,” in Proc. of IEEE In-

ternational Conference on Computer Design (ICCD), 1997, pp. 706–711.

[64] M. Greenstreet and K. Steiglitz, “Bubbles can make self-timed pipelines fast,” The Journal

of VLSI Signal Processing, vol. 2, no. 3, pp. 139–148, Nov. 1990.

[65] T. A. Chu, “Synthesis of self-timed control circuits from graphs: An example,” in Proc. of

IEEE International Conference on Computer Design (ICCD), 1986, pp. 565–571.

[66] T. Nanya, Y. Ueno, H. Kagotani, M. Kuwako, and A. Takamura, “Titac: Design of a quasi-

delay-insensitive microprocessor,” IEEE Design and Test of Computers, vol. 11, no. 2, pp.

50–63, Summer 1994.

[67] L. Plana and S. Nowick, “Concurrency-oriented optimization for low-power asynchronous

systems,” in Proc. of International Symposium on Low Power Electronics and Design

(ISLPED), 1996, pp. 151–156.

[68] H. van Gageldonk, “An asynchronous low-power 80c51 microcontroller,” Ph.D. disserta-

tion, Eindhoven University, 1998.

[69] I. Nitta, T. Shibuya, and K. Homma, “Statstical static timing analysis technology,” in

Fujitsu Publications, Oct. 2007.

[70] N. Menezes, “The good, the bad, and the statistical,” in International Symposium on

Physical Design (ISPD), 2007, p. 168.

[71] T. Gemmeke, M. Gansen, H. J. Stockmanns, and T. G. Noll, “Design optimization of

low-power high-performance DSP building blocks,” IEEE Journal of Solid-State Circuits,

vol. 39, no. 7, pp. 1131–1139, July 2004.

BIBLIOGRAPHY 128

[72] S. Borkar, T. Karnik et al., “Parameter variations and impact on circuits and microar-

chitecture,” in Proc. of the ACM/IEEE Design Automation Conference, June 2003, pp.

338–342.

[73] A. K. Uht, “Uniprocessor performance enhancement through adaptive clock frequency

control,” IEEE Trans. on Computer, vol. 54, no. 2, pp. 132–140, Feb. 2005.

[74] K. von Arnim, E. Borinski, and P. Seegebrecht, “Efficiency of body biasing in 90-nm cmos

for low-power digital circuits,” IEEE Journal of Solid-state Circuits, vol. 40, no. 7, pp.

1549–1556, July 2005.

[75] J. W. Tschanz, J. Y. Kao, S. G. Narendra et al., “Adaptive body bias for reducing impacts

of die-to-die and within-die parameter variations on microprocessor frequency and leakage,”

IEEE Journal of Solid-state Circuits, vol. 37, no. 11, pp. 1396–1402, Nov. 2002.

[76] T. D. Burd, A. Pering, A. J. Stratakos, and R. W. Brodersen, “A dynamic voltage scaled

microprocessor system,” IEEE Journal of Solid-state Circuits, vol. 35, no. 11, pp. 1571–

1580, Nov. 2000.

[77] J. L. Hennessey and D. A. Patterson, Computer Architecture: A Quantative Approach.

Morgan Kaufmann, 2007.

[78] “ASPIDA,” in http://www.opencores.org/projects.cgi/web/aspida.

[79] “Clock domain crossing: Closing the loop on clock domain functional implementation

problems,” in http://w2.cadence.com/whitepapers/cdc wp.pdf. Cadence Design Systems,

Inc., 2004.

[80] A. Lines, “Asynchronous interconnect for synchronous SoC design,” IEEE Micro, vol. 24,

no. 1, pp. 32–41, Feb. 2004.

[81] “ASIC design flow,” in http://www.faraday-tech.com/html/products/asic/Design Flow.html.

[82] “DLX GCC,” in http://www2.ucsc.edu/courses/cmps111-elm/dlx.

[83] “MiBench,” in http://www.eecs.umich.edu/mibench.

BIBLIOGRAPHY 129

[84] L. H. Lee, B. Moyer, and J. Arends, “Instruction fetch energy reduction using loop caches

for embedded applications with small tight loops,” in Proc. of International Symposium

on Low Power Electronics and Design (ISLPED), 1999, pp. 267–269.

[85] B. Gorjiara and D. Gajski, “Automatic architecture refinement techniques for customizing

processing elements,” in Proc. of the 45th ACM/IEEE Design Automation Conf., June

2008, pp. 379–384.

[86] “NISC technology website,” in http://www.cecs.uci.edu/∼nisc.

[87] S. Sirowy, W. Yonghui, S. Lonardi, and F. Vahid, “Clock-frequency assignment for multiple

clock domain systems-on-a-chip,” in Proc. of the Design, Automation & Test in Europe

Conference & Exhibition, Apr. 2007, pp. 1–6.

[88] N. Toosizadeh, S. G. Zaky, and J. Zhu, “Varipipe: Low-overhead variable-clock syn-

chronous pipelines,” in Proc. of the 27th IEEE International Conference on Computer

Design (ICCD), Lake Tahoe, California, Oct. 2009.

[89] L. Benini, E. Macii, and M. Poncino, “Telescopic units: Increasing the average throughput

pipelined designs by adaptive latency control,” in Design Automation Conf., June 1997,

pp. 22–27.

[90] H. W. Ott, Noise Reduction Techniques in Electronic Systems, 2nd ed. John Wiley &

Sons, 1988.

[91] DALLAS SEMICONDUCTOR, “App note 3512: Spread-spectrum clock oscillators lower

EMI,” 2005.

[92] H. G. Epassa et al., “Implementation of a cycle by cycle variable speed processor,” in Proc.

of ISCAS, May 2005, pp. 3335–3338.

[93] S. M. Nowick, “Design of a low-latency asynchronous adder using speculative completion,”

in IEE Proceedings – Computers and Digital Techniques, Sep. 1996, pp. 301–307.

Enhanced Synchronous Design Using Asynchronous Techniques€¦ · Enhanced Synchronous Design Using Asynchronous Techniques Navid Toosizadeh Doctor of Philosophy Graduate Department

Documents