Accelerator-Based Architectures for Wireless Sensor Network ...

Accelerator-Based Architectures for WirelessSensor Network Applications

A dissertation presented

by

Mark David Hempstead

to

School of Engineering and Applied Science

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the subject of

Engineering Sciences

Harvard University

Cambridge, Massachusetts

May 2009

c2009 - Mark David Hempstead

All rights reserved.

Thesis advisor Author

David Brooks and Gu-Yeon Wei Mark David Hempstead

Accelerator-Based Architectures for Wireless Sensor Network

Applications

Abstract

Growing power consumption threatens the explosive growth that the semiconductor

industry has sustained over the last several decades. While the number of transistors

continues to double every process technology generation, the slowing of constant field

scaling has caused power density to increase limiting clock frequency. To combat these

trends, designers must get more performance from each transistor switch. Technology

companies are applying microprocessors to a growing diversity of applications that

are increasingly mobile and untethered from the power grid. One such domain is

the emerging area of wireless sensor networks (WSNs) where, because nodes are

often deeply embedded in an environment, power consumption is the primary design

constraint.

This dissertation explores the challenges of designing in a power-constrained era

through the development of a model we call Navigo and the design and implemen-

tation of an accelerator-based architecture for WSNs. We designed Navigo to aid in

early architecture exploration as an alternative to the spreadsheets and back-of-the-

envelope calculations that planners use to guide future designs. The results show

that, even under ideal conditions, multicore processors will not achieve the perfor-

mance gains necessary to maintain growth. This dissertation shows that if an increas-

iii

Abstract iv

ing amount of area per technology node is allocated to specialized accelerators, then

microprocessor performance growth will be maintained.

As a case study of accelerator-based architectures, we developed a processor for

WSNs. Our architecture includes accelerators for regular tasks and event handling is

offloaded to the event processor, removing the software overhead of a general purpose

design. Because the architecture is modular, VDD-gating can be employed to address

leakage current at the architecture level. We built a prototype in 130nm CMOS. We

compare our system to other systems in the literature and a general purpose-based

design. Our system has the lowest energy per equivalent instruction and results of

our workload analysis shows the system is suited both for low-intensity and high-

performance WSN applications.

Contents

Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiCitations to Previously Published Work . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

1 Introduction and Summary 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Market Requirements . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Accelerator-based Architectures . . . . . . . . . . . . . . . . . . . . . 71.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Navigo: A Model to Study Power-Constrained Architectures andSpecialization 132.1 Navigo: A Model for Performance Trends in Future Technologies . . . 15

2.1.1 Modeling Methodology and Sample Libraries . . . . . . . . . . 172.2 Power-constrained Performance for Multi-core . . . . . . . . . . . . . 23

2.2.1 Results without Power Constraints . . . . . . . . . . . . . . . 242.2.2 Results with Power Constraints . . . . . . . . . . . . . . . . . 26

2.3 Validating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Modeling Specialization . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.1 Variant of Amdahls Law for Specialization . . . . . . . . . . . 352.4.2 Examples of Specialized Cores . . . . . . . . . . . . . . . . . . 38

2.5 Model Limitations and Future Directions . . . . . . . . . . . . . . . . 42

v

Contents vi

3 An Ultra Low Power Event Driven Architecture for WSNs 463.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.1 Overview of WSN Applications . . . . . . . . . . . . . . . . . 483.1.2 PowerTOSSIM Modeling Commercially Available Systems for

WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.1.3 Low-Power Circuit Design Techniques . . . . . . . . . . . . . . 573.1.4 Energy Scavenging . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Goals of the Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.3.1 System Bus Description . . . . . . . . . . . . . . . . . . . . . 663.3.2 Event Processor Specification . . . . . . . . . . . . . . . . . . 683.3.3 Description of Accelerators and Other Blocks . . . . . . . . . 70

3.4 Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 743.4.1 Performance Modeling - SystemC Simulator . . . . . . . . . . 753.4.2 Test Application . . . . . . . . . . . . . . . . . . . . . . . . . 753.4.3 Cycle Performance Estimates . . . . . . . . . . . . . . . . . . 78

3.5 Selection of Process Technology . . . . . . . . . . . . . . . . . . . . . 793.5.1 Background on Technology Scaling . . . . . . . . . . . . . . . 803.5.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 813.5.3 Modeling Architecture Across Process Technologies . . . . . . 853.5.4 Results of System Analysis . . . . . . . . . . . . . . . . . . . . 89

4 Silicon Implementation and Evaluation of Accelerator Based Sys-tems 994.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.1.1 Design Flow and Tools Used . . . . . . . . . . . . . . . . . . . 1014.1.2 VDD-gate circuit . . . . . . . . . . . . . . . . . . . . . . . . . 1024.1.3 Die-Photo and Test Chip Specifications . . . . . . . . . . . . . 103

4.2 Measurements of Prototype . . . . . . . . . . . . . . . . . . . . . . . 1044.2.1 Test Methodology and Setup . . . . . . . . . . . . . . . . . . . 1054.2.2 Functional Verification . . . . . . . . . . . . . . . . . . . . . . 1064.2.3 Block Level Power Measurements . . . . . . . . . . . . . . . . 1084.2.4 Energy per Task and Energy per Instruction . . . . . . . . . . 110

4.3 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . 1124.3.1 Categorization and Description of Similar Systems . . . . . . . 1124.3.2 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 113

4.4 Comparison to General Purpose Microcontroller . . . . . . . . . . . . 1154.4.1 Performance and Energy Benefits of Specialization . . . . . . . 1164.4.2 Workload Analysis and DVFS . . . . . . . . . . . . . . . . . . 119

4.5 Using Navigo to Guide Future Revisions . . . . . . . . . . . . . . . . 122

Contents vii

5 Conclusion and Future Directions 1265.1 Summary of Themes and Results . . . . . . . . . . . . . . . . . . . . 1275.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.2.1 Improved Modeling Frameworks . . . . . . . . . . . . . . . . . 1295.2.2 Memory Systems for Accelerator-Based platforms . . . . . . . 1305.2.3 Applying Accelerator-Based Architectures to Desktop/Mobile

platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Bibliography 134

A Related Work: Description of Similar Systems 141A.1 General Purpose Commodity Based Systems . . . . . . . . . . . . . . 141A.2 Smart Dust - Early Event Driven . . . . . . . . . . . . . . . . . . . . 142A.3 Subthreshold Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.4 Asynchronous - SNAP . . . . . . . . . . . . . . . . . . . . . . . . . . 146A.5 Charm - Network Stack Acceleration . . . . . . . . . . . . . . . . . . 148

B Detailed Design Documents 150B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.3 Interrupt Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153B.4 Power Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

List of Figures

1.1 Growth in Microprocessor Performance. Historically the indus-try has observed a total 1.58x performance gain per year. Power con-sumption constraints inhibit performance growth causing a gap betweenexpected and delivered performance. Data from Hennessy and Patter-son [25] and spec.org [54]. . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Research Approach. We take a holistic approach to research understanding and addressing power consumption at all layers of thedesign space. Architecture innovations are informed by modeling andprototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Graphical depiction of Navigo. The model accepts library files forprocess technology, circuits, architecture, and market segments, andcomputes total and constrained power for a set of user-defined inputssuch as supply voltage, frequency, etc. . . . . . . . . . . . . . . . . . . 16

2.2 Results without power constraints across process technolo-gies. Results assume nominal voltage for specified technology andMPU-HP market segment with a die size of 310 mm2. . . . . . . . . . 25

2.3 Results with power constraints across process technologies -Server. Results assume nominal voltage for specified technology andMPU-HP market segment with a die size of 310 mm2 and max powerof 198 W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Results with power constraints across process technologies -Mobile. Results assume nominal voltage for specified technology andMobile market segment with a die size of 100 mm2 and max power of35 W. Vdd is limited to VddMIN. . . . . . . . . . . . . . . . . . . . . 28

2.5 Results with power constraints across process technologieswithout VddMIN constraints - Mobile. Results assume nominalvoltage for specified technology and Mobile market segment with a diesize of 100 mm2 and max power of 35 W. Vdd can be reduced withouta lower limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

viii

List of Figures ix

2.6 Validation of Navigo using Microprocessors from 1996 to 2007.Predicted results use the most recent ITRS technology models. The ini-tial core model is an Alpha 21164 0.5 GHz in 250nm technology in-troduced in 1996. The data points representing commercially availablesystems are also presented in Figure 2.5 . . . . . . . . . . . . . . . . 33

2.7 Speeding up an application with specialized cores. A workloadis split to an additional set of resourcesthe specialized core. Thefraction of the application that can be executed on the specialized coreis f , with a speedup of S. . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Understanding the impact of specialization on throughput.Calculations of throughput with specialization for different speedups (S)and fractions of workload (f). Assumes the general purpose core isfully utilized and resources for an additional specialized core has beenprovisioned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.9 Specialization across process technologies with real SP cores.Total throughput for different values of f assuming the area and speedupof one example SP core per GP core. Mobile 35W market segment. . 40

2.10 Configurations that can achieve 1.58x/year throughput. Modeltwo different accelerator structures the programmable CELL SPE andan H.264 accelerator. Core2Duo-based GP cores and the Mobile 35Wmarket assumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1 Measured and simulated current consumption for the Beaconapplication. The simulated version includes a breakdown according toradio, LEDs, and CPU current. A lower resolution digital multi-meterwas used for the above measurement, which did not capture the veryshort duration peak power spikes during the wakeups. . . . . . . . . . 55

3.2 Surge Application Power Consumption Breakdown. 60 sec ofthe surge TinyOS application run on the Mica2 mote. . . . . . . . . 56

3.3 System Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . 653.4 Event Processor State Machine . . . . . . . . . . . . . . . . . . . 683.5 Diagram and Code of the Monitoring Application. The code

displayed are ISR routines written for the event processor. Actual ad-dress values have been omitted to make the code easy to read. . . . . . 76

3.6 Test Circuit Used for Simulations. The circuit consists of an 11stage ring oscillator made up of an assortment of logic gates. Inter-connect was modeled between devices. . . . . . . . . . . . . . . . . . . 81

3.7 Leakage Power, EDP, and Frequency Across all TechnologiesEach line indicates a technology node from 180nm to 70nm. Supplyvoltage is on the X-axis which was swept from 0.1V to the max VDDspecific to the process. Temperature is 20C and all transistors areminimum size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

List of Figures x

3.8 Results for Baseline Architecture. Performance target of N=100sense and transmit tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.9 Effect of Energy Reduction Techniques on Total Energy Con-sumption of the Architecture Across Process Technologies.Power Supply voltage is limited to V tP + V tN and the number of tasksper second is 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.10 Summary of Energy Reduction Techniques Across ProcessTechnologies Each bar represents the minimum energy calculated fora particular architecture configuration and process technology. Both thetotal energy consumption and a percentage breakdown of the source ofenergy consumption are included. . . . . . . . . . . . . . . . . . . . . 95

4.1 Custom VDD-Gating Circuit. The schematic shows four differentparallel legs which are used to control VDD-gating strength. Layout ofthe filter component shows where the VDD-gating circuit is attached.In this example, the VDD-gating circuit requires an additional area of3.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.2 Die Photograph of 130nm Prototype. System includes an eventprocessor and several accelerators for regular operation. The systemhas been realized in 130nm CMOS on a 2mm x 2mm die. The systemcontains 444,982 transistors including 4KB of foundry supplied SRAM. 104

4.3 Frequency verses Voltage Shmoo. Shaded region of plot indicateswhere the test failed the unshaded region indicates successful opera-tion. Results from a full run of the sense and transmit application wereused to generate a shmoo. Due to limitations of the test board the chipwas measured up to 12.5 MHz. The shmoo generated using post layoutsimulations indicate the chip will work up to 100 MHz . . . . . . . . . 107

4.4 Measured power consumption of the prototype under differ-ent supply voltages and clock frequencies. Plots a-c show thepower consumption for the Event Processor, Accelerator, and SRAMpower domains while sweeping voltage from 450 MV to 800 MV andfrequency from 25 kHz to 12.5 MHz. Idle power is measured with theexternal clock off (0MHz @550mV). The VDD-gating transistor is off(not-conducting) during the measurement of gated power. . . . . . . 109

4.5 Energy per Task of Sense and Transmit Task. Application in-cludes all accelerator blocks and power contributions from the SRAMand Event Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.6 Comparison to Other Systems Designed for WSN. . . . . . . . 1144.7 Performance and Power Benefits of Specialization. Test rou-

tines were executed both on the hardware accelerators and the micro-controller. Cycle count and energy savings are presented. . . . . . . 118

List of Figures xi

4.8 Evaluation of Accelerator-based Architecture vs. GeneralPurpose System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.9 WSN architecture projected to advanced process technologiesand power budgets. Die size, f , S are fixed to based on measure-ments of the original system. Area is swept and the configuration withthe maximum throughput is reported for three different power budgets. 125

A.1 Smart Dust Microarchitecture[59]. . . . . . . . . . . . . . . . . . . . . 143A.2 Block Diagram of the Subliminal Processor (University of Michigan)[51].145A.3 Simplified block diagram of the SNAP processor for WSN. System

includes separate instruction and data memories, a timer coprocessor,and a message processor which provides a FIFO interface to the off-chipradio and sensors[9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.4 The Charm protocol processor microarchitecture[52]. . . . . . . . . . 148

List of Tables

2.1 Predicted Process Technology Characteristics. High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50]. . . . . . . . . . . 18

2.2 Technology Scaling Factors. High-Performance MicroprocessorLogic. Indicates a departure from historical scaling trends resultingin an increase in power density. [50] . . . . . . . . . . . . . . . . . . . 19

2.3 Example Cores used in analysis. Data collected from conferenceand journal publications and datasheets. SPEC2006 results used todetermine IPC are from spec.org. . . . . . . . . . . . . . . . . . . . . 20

2.4 Market Segment Constraints. Die size and Max Power Consump-tion for a set of market segments. Values for the first three marketscame from ITRS [50]. The final four market segments are based on diesize and thermal design point of commercially available Intel Processors. 21

2.5 Select Microprocessors from 1996 to 2007. Performance data isfrom the analysis in Figure 1.1. Power consumption and die size datawas acquired from datasheets and published microprocessor reports. . . 32

2.6 Specialized Cores. Example SP cores used in the model. All mea-surements were scaled to 65nm technology and speedup was calculatedby comparing published performance results to the performance on ageneral purpose CPU. The Core2 is included to show the relative areaand performance cost of including another GP core instead of an SPcore. Power and speedup for CELL SPE running Linpack. . . . . . . 39

3.1 Sensor Sampling Rates of Different Phenomena . . . . . . . . 493.2 Example WSN application domains. . . . . . . . . . . . . . . . . 503.3 Power model for the Mica2. The mote was measured with the

micasb sensor board and a 3V power supply. . . . . . . . . . . . . . . 543.4 Event Processor Instruction Set . . . . . . . . . . . . . . . . . . 693.5 Comparison of cycle count for the test application written on

our architecture and on TinyOS for the Mica Platform. . . . 783.6 Scaling Factors From theory and simulation data . . . . . . . . . . 87

xii

List of Tables xiii

3.7 Activity Ratios for Our Test Application . . . . . . . . . . . . 88

B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . . . . . . . 151B.2 System Memory Map All addresses are in hex . . . . . . . . . . . 152B.3 System Interrupt Map Lists all of the interrupts in the prototype

and the source of the interrupt. . . . . . . . . . . . . . . . . . . . . . 153B.4 Power Domains in the PrototypeLists all of the power domains

in the prototype including virtual power domains and power domainsfor testing only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Citations to Previously Published Work

The architecture presented in Chapter 3 first appeared in the following paper:

An ultra low power system architecture for sensor network applications,Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu-Yeon Wei, andDavid Brooks, In The 32nd Annual International Symposium on Com-puter Architecture (ISCA), June 2005.

The PowerTOSSIM simulator, presented in Section 3.1.2 including figure 3.1, ap-peared in:

Simulating the Power Consumption of Large Scale Sensor Network Ap-plications, Victor Shnayder, Mark Hempstead, Bor-Rong Chen, GeoffWerner Allen, and Matt Welsh, In Proceedings of the Second ACM Con-ference on Embedded Networked Sensor Systems (SenSys), Baltimore,MD, Nov 2004.

The evaluation of process technology selection, presented in Section 3.5, appeared in:

Architecture and Circuit Techniques for Low-Throughput, Energy-ConstrainedSystems Across Technology Generations, Mark Hempstead, Gu-YeonWei and David Brooks, In Proceedings of the International Conference OnCompilers, Architecture, And Synthesis For Embedded Systems(CASES).Seoul South Korea. October 2006.

The related work, presented in Section 4.3 and Appendix A was first surveyed in thefollowing invited paper:

Survey of hardware systems for wireless sensor networks, Mark Hemp-stead, Michael J. Lyons, David Brooks and Gu-Yeon Wei. ASP Journalof Low Power Electronics, Vol. 4., No. 1, April 2008.

The Navigo model presented in Chapter 2 is currently under submission in the fol-lowing paper:

Navigo: A Model to Study Power-Constrained Architectures and Spe-cialization, Mark Hempstead, Gu-Yeon Wei, and David Brooks [UnderSubmission]

The measurement results of our prototype, presented in Chapter 4, are currentlyunder submission:

An accelerator-based wireless sensor network processor in 130nm CMOS,Mark Hempstead, David Brooks, and Gu-Yeon Wei, [In preparation]

xiv

Acknowledgments

The path to this PhD has been an adventure, and I would like to take this op-

portunity to thank all of those who have helped and supported me along the way.

Throughout my journey the path was often hard to find and, without the guidance and

encouragement from these individuals, I would have never overcome the academic,

technical, and emotional challenges that blocked my way.

First, I would like to thank my advisers Gu-Yeon Wei and David Brooks for taking

a chance on me to start a fruitful collaboration across the disciplines of circuit design

and architecture. Throughout the last few years they have supported and guided my

transformation as a researcher. I appreciate the endless hours they spent providing

feedback on talks, papers, and chips, pushing me to think more deeply. Early in

my research career I received valuable feedback from my qualification committee,

Woodward Yang and Paul Horowitz. I am grateful to Margo Seltzer for her instruction

in paper writing and presentations in CS261 and, more recently, for agreeing to serve

on my dissertation committee.

Throughout the duration of my research project, several individuals helped me

with architecture exploration and early Verilog coding, including: Nikhil Tripathi,

Patrick Mauro, and Xiaoyao Liang. Michael Lyons and I have enjoyed a strong col-

laboration brainstorming the design of SMASH, next generation architecture. I wish

to thank the other members of the Mixed-signal VLSI and Architecture groups: Am-

ber Tan, Ruwan Ratnayake, Andrew Liu, Hayun Chung, Ankur Agrawal, Wonyoung

Kim, Durlov Khan, Meta Gupta, Benjamin Lee, VJ Reddi, and Kevin Brownell.

They provided invaluable instruction and support when I was met with problems

using CAD tools, test equipment, and architecture simulators. Moreover, they were

xv

Acknowledgments xvi

the source of supportive conversations at lunch, over dinner and during late night

tape-outs.

Halfway through my grad student career, our group received the gift of Glenn

Holloway, whose management of our machines and debugging support at all hours

saved me weeks of frustration. Jim MacArthur in the Cruft circuits lab was an

invaluable resource when I needed help designing PCBs, soldering, or finding random

parts. Because my research crossed into the systems realm, early collaborations with

the wireless sensor network (WSN) groupincluding Matt Welsh, Geoffrey Werner

Challen, Victor Shnayder, and Bor-Rong Chenhelped me understand the needs

of the WSN community. Im thankful to UMC and the SRC for supporting the

fabrication of my two test chips. I would like to thank Joel Emer, Mark Charney, and

Geoffrey Loweny for hosting me at Intel in Hudson, MA for a summer and exposing

me to research in higher performance systems.

For me grad school was more than just researchI had the opportunity to en-

gage in a diverse set of opportunities from teaching to graduate student organization

and the Harvard house system. Harry Lewis introduced me to his unique course,

QR48:BITS, and he was a wonderful teaching mentor who gave me the chance to try

my hand at lecturing. Likewise, Woodward Yang showed me how to coach students

in engineering design in ES96. Im thankful to Hwa Chang and Jeffery Hopwood at

Tufts for mentoring me after I took over the digital logic class this semester. I would

like to encourage the students who have taken over the graduate student life commit-

tee to continue the good work of building a community within SEAS and motivating

graduate students to leave their labs occasionally. For the past three years, my fellow

Acknowledgments xvii

tutors, masters, and students have made Lowell House into a vibrant and supportive

home.

Throughout my graduate school experience, it was the support of my caring friends

and family that kept me going. Specifically, I would like to thank my parents, David

and Rolande, who brought me up with such caring and supported me with a smile

when I turned down a job in the real world for graduate school. My father, who

taught me to think like an engineer at a young age through his probing questions at

the dinner table, continues to challenge me today. My mother, who rightly believes I

need emotional support just as much as technical support, continues to pick me back

up after each paper rejection. My sister Amy was my lifeline here in Boston over the

past few years. Though she has suppressed her engineering genes, she continues to

surprise me with a display of her scientific mind over a bottle of wine. My brother

Chris, the more practical engineer, taught me how to put a square peg in a round

hole with a big hammer. His thoughtfulness and ingenuity just might convince

me to start a company with him ... someday. Finally, I cannot give enough thanks

to Megan, whose caring, kindness, and support over the last few years made this

dissertation possible and easier to read. I look forward to many more adventures

together and one more dissertation between us.

Dedicated to those who have paved the way for me

my parents David and Rolande,

and my grandparents David and Margaret Hempstead, and Rudy and

Lillian Perreault.

xviii

Chapter 1

Introduction and Summary

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . . . . 3

1.1.2 Market Requirements . . . . . . . . . . . . . . . . . . . . . 4

1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Accelerator-based Architectures . . . . . . . . . . . . . . 7

1.4 Summary of Contributions . . . . . . . . . . . . . . . . . . 8

Advances in computational capabilities have driven the information technology

revolution, which in turn has driven advances in nearly all fields of science, medicine,

and business. Although incredibly powerful computing devices are available today,

this single-minded pursuit of performance has made power consumption one of the

main bottlenecks for nearly all types of computing systems, from high-end servers

to wireless sensor devices. Due to limitations in device cooling at the high-end and

battery technology at the low-end, processor designs are increasingly stratified into

power-constrained market segments in which the challenge is to increase processor

1

Chapter 1: Introduction and Summary 2

performance for a fixed power budget. While advanced fabrication technology are

projected to continue to provide computer designers a doubling of transistors per

generation, slowing constant-field scaling and worsening wire parasitics will see the

energy per switching event scale at a rate in which chip power will essentially re-

main constant with fixed clock frequency and core activity. Current trends towards

large multi-core systems utilize the additional transistor bounty for additional power-

efficient cores but, with single-thread performance saturated, most benefits will come

through thread-level parallelism. Assuming an optimistic scenario for the continued

extraction of thread-level parallelism from workloads, chip performance gains will

track growth in transistor counts. The International Technology Roadmap for Semi-

conductors (ITRS) projects a doubling in the number of transistors every three years

(e.g., 1.25x per year) leading to an increasing gap between projected performance

growth and historical performance growth rates. Bridging this performance gap will

require an architectural paradigm shift to augment the multi-core trend, in which

an increasing fraction of chip real estate must be devoted to specialized logic that

provides significant benefits in performance per switching event for a growing portion

of workloads.

This dissertation argues that maintaining growth in system performance requires

using transistors more efficiently to achieve higher performance per watt. The power

consumption of a computing device depends on all layers of the design space, from

the application software, to circuits and process technology and system architecture.

This work takes a holistic approach by developing models and designs incorporating

all layers of the design space. In this chapter, we describe the technology and mar-


1985 1990 1995 2000 2005 2010 2015 2020

102

104

106

Year

CP

U P

erfo

rman

ce

Histor

ical T

rend:

1.58x

Power Constrained Era

Multi-core

Single-thread

Performance Predictions

Figure 1.1: Growth in Microprocessor Performance. Historically the industryhas observed a total 1.58x performance gain per year. Power consumption constraintsinhibit performance growth causing a gap between expected and delivered performance.Data from Hennessy and Patterson [25] and spec.org [54].

ket conditions that motivate this work and our holistic approach. We also describe

accelerator-based architectures in general and allude to a prototype that we designed

and taped-out for this work. Finally, we summarize the main contributions of this

work.

1.1 Motivation

1.1.1 Technology and Trends

Over the past few decades the performance of microprocessors has grown steadily.

However, over the past several years designers have been forced to slow the growth


of single thread performance because of increasing power consumption. To explore

these trends, Figure 1.1 plots both historical performance growth and projected multi-

core and single-threaded performance growth until 2020. All data in the plot is

relative to the VAX 11/780 as measured by SPECint benchmarks data in the plot

previous to 2005 was obtained from Hennessy and Patterson, and data for recent years

was obtained using the highest single-die performance SPECint2006 (single-thread)

and SPECint2006rate (multi-core) from the SPEC website [25, 54]. Performance

growth began to deviate from the historical 1.58x per year trend in 2001, primarily

due to the difficulty of obtaining additional clock frequency and instruction-level

parallelism improvements in the face of power constraints. The computing industry

has reacted to this trend by concentrating on multi-core designs that capture thread-

level parallelism. Unfortunately, as detailed in this work, power issues will limit

multi-core performance growth from meeting the historical trend, and closing this

gap will require more efficient use of transistors.

1.1.2 Market Requirements

The growth of the semiconductor industry has not only been driven by perfor-

mance gains but also by a growing diversity of applications for microprocessors. Mi-

croprocessors have moved out of government and corporate computing centers into

homes, schools, coffee shops, and, now, pockets and pocket books. As microproces-

sors have found additional uses beyond high performance and desktop computing,

new design constraints are being applied to microprocessors among them power,

size, and cost.


Power consumption is increasingly the primary design constraint for mobile and

embedded devices, as designers try to maximize battery life and reduce cooling cost.

The performance and power consumption requirements across market segments vary

by several orders of magnitude high-performance servers have a power limit of

200W while some processors for laptops and netbooks are designed to consume a

maximum of 1-10 W (Chapter 2 includes a more detailed list of market segments and

power constraints). The power constraints imposed by the market are contradictory

to the increase in power density caused by technology scaling. Because, mobile and

embedded devices are untethered from the power grid, power consumption has been

a concern within these communities for some time.

The emerging market segment of wireless sensor networks (WSNs) places even

more stringent power constraints on processor design and therefore is an indicator

of what is to come for the other market segments in the future. Wireless sensor

networks have applications in medicine, science, industrial automation and security.

WSN nodes are often deeply embedded in an environment and decoupled from the

wired power grid. Consequently, designers would like used scavenged energy to power

WSN devices indefinitely. Currently available energy scavenging methods place a

power consumption constraint of roughly 100W on microprocessors designed for

environmentally powered WSNs (a more detailed background of WSNs and energy

scavenging is presented in Section 3.1). These strict limits on power consumption

provide increased design pressure to maximize performance-per-watt. As technology

scales and power density increases, other market segments will face similar design

challenges.


83

Research Strategy

Application

Holistic Approachaddresses power

consumption at all layers

Architecture informed by modeling and prototyping

Architecture

Circuits

Process Tech

Network

Circuit Simulations

Prototyping

Design (Architecture/Circuits)

Modeling (Power + Performance)

(a) Holistic Approach 83

Research Strategy

Application

Holistic Approachaddresses power

consumption at all layers

Architecture informed by modeling and prototyping

Architecture

Circuits

Process Tech

Network

Circuit Simulations

Prototyping

Design (Architecture/Circuits)

Modeling (Power + Performance)

(b) Research Cycle

Figure 1.2: Research Approach. We take a holistic approach to research un-derstanding and addressing power consumption at all layers of the design space. Ar-chitecture innovations are informed by modeling and prototyping.

This work investigates the impact of technology scaling on power consumption. As

this section has described, the pressures of a power-constrained era require designers

to think about improving performance per watt by using transistors more efficiently.

This work takes a holistic approach looking at all areas of the design space, using the

emerging domain of WSNs as a case study in ultra-low power design.

1.2 Holistic Approach

During the course of our research, we have taken the view that all layers of the

design space influence power consumption, from the application and network to the

architecture and circuits. Figure 1.2 provides a graphical description of the research

approach we employed. Our research efforts follow an iterative approach through


modeling, design and prototyping and our models incorporate inputs from a variety

of design layers. For example, the PowerTOSSIM model (Section 3.1.2) accepts inputs

from the network and application layers and physical power measurements of nodes

while the Navigo model (Chapter 2) takes data from circuit simulations, process

technology data and performance benchmarks of different architectures.

We use modeling to guide design decisions which are verified by circuit simulations

and prototyping. Chapter 3 describes a design motivated by the modeling of appli-

cation behavior and addresses leakage current, which is increasing due to technology

scaling. Because our power consumption targets are so low, we developed a prototype

in 130nm CMOS to verify that our design achieves ultra low power operation. Both

the power and performance measurements of the prototype, presented in Chapter 4,

prompt more analysis and modeling of generalized accelerator-based architectures.

Consequently, results from our prototype and modeling efforts will drive our future

research efforts.

1.3 Accelerator-based Architectures

Both the trends in technology and market pressures to increase power efficiency

reveal the need to extract more computation for each transistor switch. Many de-

signers intuitively believe that application specific integrated circuits (ASICs) pro-

vide higher performance and increased energy efficiency over general purpose based

designs. However, ASICs are tuned for a particular set of computations and hence do

not posses the flexibility and programmability of a general purpose processor. One

approach, used by the system-on-chip community, places ASIC accelerators on a chip


with a general purpose microcontroller. As we show in this work, an accelerator-based

approach has the potential to compensate for the loss of performance due to power

constraints. We show that maximizing total system performance requires that the

accelerators provide application speedup (S) for a large fraction of the workload (f).

The regular nature of computation and the ultra-low power requirements of the

WSN application domain make it well-suited to benefit from an accelerator-based

architecture. As a case study of accelerator-based architectures, we designed and

implemented a processor for WSN applications. Our implementation utilizes the

modular nature of the architecture to turn off unused accelerators and address leakage

current with architecture. We also do away with the notion that the system needs

to be controlled by a high powered general purpose core and, instead, we replace it

with an event-driven state machine. Traditionally, the energy efficiency of a system

has been evaluated through the metric of energy-per-instruction. The concept of

instruction is lost on accelerator-based architectures and, therefore, we propose several

new methods to analyze the efficacy of our prototype.

1.4 Summary of Contributions

This work presents the combined contributions of four different modeling and

analysis frameworks and a ground-up silicon implementation of a processor for wire-

less sensor networks. Following the research approach presented in Section 1.2, the

modeling frameworks are informed by several layers of the design space applica-

tions, architecture, circuits, and process technology. The Navigo model, presented

in Chapter 2, accepts libraries that describe architecture features, process technology


characteristics, voltage and frequency relationships from circuit simulations. Through

the analysis of the inputs, Navigo reports an estimate of performance and power con-

sumption for future generations of microprocessors. The results revel that power con-

sumption increasingly limits performance. Subsequent analysis with Navigo shows

that specialization can provide the necessary performance-per-watt. However, the

high level analysis from Navigo needed to be grounded in a real implementation to

understand the benefits and costs of accelerator-based architectures. The design of

our prototype was informed by our modeling efforts of wireless sensor network applica-

tions with PowerTOSSIM, presented in Section 3.1.2, and a understanding of process

technology trends. Likewise, the architecture of the prototype drives the analysis in

Chapter 4 and the process technology study in Section 3.5. Through the models and

prototype, this work presents the following insights and major contributions.

Navigo: A Model to Study Power-Constrained Architectures and Specialization (Chap-

ter 2)

Modeling Framework for Early Exploration Currently designers use intuition

and spreadsheet-based models to explore design decisions and estimate power

consumption and performance of architectures five to fifteen years away from

tape-out. Navigo provides features not available in spreadsheet-based models

including voltage-frequency scaling to meet power constraints and input from

circuit simulations. By incorporating different architecture models, Navigo can

be used to model massive multi-core designs.

Amdahls Law for Specialization We enhanced Amdahls law to model het-

erogeneous accelerators that can provide a speedup (S) for a fraction of appli-


cations (f). Including the enhanced Amdahls law and architecture models of

specialized accelerators, Navigo can be used to compare homogeneous multi-core

designs with designs that include specialized accelerators.

Results show Increasing Effect of Power Constraints Results using Navigo

reveal that performance of multi-core systems will be significantly reduced due

to power constraints. While some designers intuitively understand this result,

our work it is one of the first quantitative presentations of this issue. This result

should serve as a call to action to develop systems with a higher performance-

per-watt.

Analysis for Amount of Specialization By including specialized accelerators

in the model, we use Navigo to select the amount of specialization (both S and

f) required to maintain the performance growth shown in the semiconductor

industry. This analysis gives designers the target amount of area to allocate to

specialization in designs over the next decade.

Accelerator-Based Architecture for Wireless Sensor Networks (Chapter 3)

Holistic Design Informed Through Application and Circuits We built the

PowerTOSSIM to study the power consumption of WSN applications. We used

insights gained from PowerTOSSIM to guide our design of the system architec-

ture.

Accelerator Based Event-Driven Architecture The custom architecture for

WSN includes hardware accelerators for regular tasks, we offloaded event pro-


cessing to a custom hardware component (Event Processor), and we address

leakage power with architecture support for VDD-gating.

Performance Improvements over Mica2 A SystemC model of the architecture

shows a 10x performance improvement over the Mica2 architecture for typical

WSN tasks.

Framework for Process Technology Selection We built a framework to eval-

uate the selection of process technology. We based the framework on a Verilog

model of the architecture and circuit simulations of different process technology

generations. The results show that because of increasing leakage current, the

most advanced process technology node is not the best choice to minimize total

system power consumption.

Silicon Implementation and Evaluation of Accelerator Based Systems (Chapter 4)

Prototype Chip in 130nm CMOS We built a prototype as a case study of

accelerator based architectures. It incorporates synthesized accelerator blocks,

custom VDD-gating circuit, and 2 KB of SRAM for a total of 444,982 transis-

tors.

Functional Verification and Per Block Power Measurements We verified the

prototype for functionality and it functions correctly up to 12.5 MHz at 550

mV. Post layout simulations estimate that the system could run up to 100 MHz

at 1.2V. Measurements of per block power show that VDD-gating saves up to

100x of idle leakage power.


New Metric of Energy per Task and Comparison to Related Work The

traditional metric of energy-per-instruction does not accurately measure an

accelerator-based architecture. Therefore we introduce two new metrics of En-

ergy per Task and Energy per Equivalent Instruction to compare the prototype

to related work. With a measured energy per task of 678.9 pJ and energy per

equivalent instruction of 0.44 pJ this system is the lowest energy processor cur-

rently available for WSNs.

Analysis of Accelerator Speedup and Energy Savings We isolate the benefits

of accelerator based computing by comparing hardware and software implemen-

tations of the routines expressed by the accelerators. The results show a 15x to

635x performance speedup and a 10x to 600x energy savings, depending on the

routine.

Comparison to General Purpose designs through Workload Analysis with Volt-

age and Frequency Scaling (VFS) We compare our system against a general

purpose design while sweeping workload intensity. Voltage and frequency scal-

ing and VDD-gating are included in the analysis. The results show that the

architecture is well-suited for low duty cycle applications and at the same time

can provide more performance for high intensity workloads than general purpose

designs.

This work provides both a high-level justification for accelerator-based architec-

tures and a case study built from the ground up. The work concludes with a discus-

sion of some of the open research questions in this area and a description of current

research efforts.

Chapter 2

Navigo: A Model to Study

Power-Constrained Architectures

and Specialization

Contents2.1 Navigo: A Model for Performance Trends in Future

Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Modeling Methodology and Sample Libraries . . . . . . . . 17

2.2 Power-constrained Performance for Multi-core . . . . . . 23

2.2.1 Results without Power Constraints . . . . . . . . . . . . . . 24

2.2.2 Results with Power Constraints . . . . . . . . . . . . . . . . 26

2.3 Validating the Model . . . . . . . . . . . . . . . . . . . . . 31

2.4 Modeling Specialization . . . . . . . . . . . . . . . . . . . . 34

2.4.1 Variant of Amdahls Law for Specialization . . . . . . . . . 35

2.4.2 Examples of Specialized Cores . . . . . . . . . . . . . . . . 38

2.5 Model Limitations and Future Directions . . . . . . . . . 42

13

Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 14

Given the technology scaling trends and market requirements presented in Sec-

tion 1.1, it is important for chip architects to understand the limitations of homoge-

neous parallelism and to consider more radical architectural approaches. This chapter

presents Navigo, a model that incorporates technology scaling effects to predict future

power-constrained performance trends. Navigo can be used to predict, for a variety

of processor cores, circuit parameters, and market segments, performance trends and

shortfalls from the historical growth rate. Future designs that seek to bridge this gap

must more effectively utilize switching events through specialized hardware. Special-

ization hardware can take many forms [11, 29, 36, 38] including programmable SIMD

units, hardcoded ASIC cores, or reconfigurable logic, and Navigo includes a general

analytical model that can capture the impact of parallel specialization on power-

constrained performance gains. This model projects the amount of specialization,

quantified in terms of several parameters, that will be required in future technology

generations to meet the historical performance scaling trends.

In addressing the problem of power-constrained performance scalability, the chap-

ter makes the following contributions:

We describe Navigo (Section 2.1), a model incorporating technology scaling,

circuit design parameters, and architectural design decisions into a high-level

model to facilitate understanding the impact of power-constrained performance.

We use Navigo to understand a large design space of input parameters (Sec-

tion 2.2).

We extend Navigo to model parallelizable specialization hardware (Section 2.4),


introducing additional parameters to quantify specialization benefits and power/area

costs. This model demonstrates that in order to maintain historical performance

growth, we must increase the amount of specialization for each technology gen-

eration.

2.1 Navigo: A Model for Performance Trends in

Future Technologies

Trends in process technology scaling, predicted by the International Technology

Roadmap for Semiconductors (ITRS), consider a variety of factors that affect the

performance scalability of future computing systems. Designers can no longer rely on

the next technology node to increase circuit performance and reduce energy consump-

tion. Constant-field scaling (or Dennard scaling [64]) has run out with limits imposed

on how aggressively one can reduce transistor threshold voltages (Vth) and supply

voltage (Vdd). The dramatic increase in leakage current has effectively flattened out

Vth scaling for planar CMOS technologies such that supply voltage scaling has also

slowed down. While technology continues to reduce transistor size, wire parasitics

are getting worse after a short respite gained by moving to copper. Lastly, the power

ceiling imposed by cooling costs and battery life further limit performance gains tra-

ditionally offered by technology scaling. In short, the landscape of processor design

has changed dramatically since the end of the twentieth century. It is imperative to

arm future designers with tools that can navigate through the complex interactions of

future process technology scaling trends on architectural and circuit design choices,


Process Technology(ITRS)

Circuits(HSPICE)

Architecture(General purpose and

specialized cores)

User-defined Inputs:Technology nodeVdd (nominal,min)Frequency# of cores and typeMarket selection

NavigoMarket constraints(Server, Desktop,

Mobile, WSN, etc.)

Outputs:ThroughputPower

Figure 2.1: Graphical depiction of Navigo. The model accepts library files for pro-cess technology, circuits, architecture, and market segments, and computes total andconstrained power for a set of user-defined inputs such as supply voltage, frequency,etc.

coupled with power budget limitations imposed across different markets segments. To

this end, we present Navigo, a detailed model that incorporates the effects of process

technology, circuits, architecture, and market to predict future processor performance

trends.

This section begins with a high-level overview of Navigo, which outlines the basic

goals and assumptions made. Then, it describes the inner details of the model,

revealing how it can be used by designers in early stages of design to help guide

high-level system and architectural design decisions.

Navigo provides designers with a powerful and flexible tool to navigate the in-

tricate tradeoffs between process technology, circuits, and architecture, in order to

predict their implications on performance in future processor designs. Figure 2.1


presents a high-level graphical representation of Navigo. The model takes in a vari-

ety of input libraries, which quantify detailed parameters corresponding to process

technology, circuit performance, architecture, and market segment constraints. While

each of these libraries can be modified by the user, Navigo includes built-in libraries

based on ITRS technology scaling predictions out to 11nm (available in 2020), pre-

dictive technology models (PTM) [47, 67], IPCs of currently available processor cores

(based on SPECint2006 scores), and high-level power and area constraints for differ-

ent market segments. With the libraries in place, the designer can sweep a variety

of input parameters such as technology node, voltage, frequency, target market, etc.

Navigo then outputs the total system throughput and power. The user can then

refine her design by iterating through different input parameters to meet a specific

throughput and/or power target.

2.1.1 Modeling Methodology and Sample Libraries

An engine that takes the various libraries and input sweep parameters to calcu-

late throughput and power consumption is at the core of Navigo. This engine must

consider a variety of factors such as the number and characteristics of computational

blocks (i.e. cores), voltage and frequency scaling, wire loading, leakage power, and

process technology, all constrained by power budget limitations. All of these factors

are quantified by the different library parameters.

The process technology library quantifies several parameters and characteristics

utilized by Navigo, which are listed in Table 2.1. These parameters set the basic

device and wire characteristics that Navigo uses to determine circuit speed, power,


Year of Production 2007 2010 2013 2016 2019 2022Planar Bulk Double Gate

Approximate node (nm) 65 45 32 22 16 11Supply Voltage (V) 1.1 1.0 0.9 0.8 0.7 0.65Physical Gate Length (nm) 25 18 13 9 6.3 4.5Id sat (uA/um) 1211 1807 2204 2627 2768 2786Intrinsic delay (ps) 0.64 0.46 0.26 0.15 0.1 0.08Intrinsic switching energy (fJ) 0.064 0.045 0.020 0.0085 0.0037 0.0020RC delay of 1mm wire (ps) 890 2100 4555 10652 23515 58525Die Size-Server (mm2) 310 310 310 310 310 310Number of Transistors (M) 1106 2212 4424 8848 17696 35391

Table 2.1: Predicted Process Technology Characteristics. High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50].

and the number of cores that will be available in future technology nodes. The built-

in process technology library uses published data from ITRS 2007 [47, 67] out to the

11nm technology node anticipated in year 2022. ITRS predicts double gate technology

will supplant planar bulk devices at the 32nm node in year 2013. Because ITRS is a

predictive roadmap based on current projections of technology, it is well-known that

the semiconductor industry has a history of either under- or out-performing ITRS.

For example, Intels technology roadmap is more aggressive with processors at the

45nm node already shipping and plans to introduce processors on the 32nm node in

late 2009. Hence, this library can be readily modified by the user to better reflect

updated ITRS projections or propriety information if available. Table 2.2 compares

technology trends up to 1999 described by Borkar [2] to ITRS 2007 predictions, which

reveals a divergence in power density. This departure from traditional constant-field

scaling affects frequency and voltage scaling in future designs, which we thoroughly

explore in Section 2.2. Throughout the rest of this chapter, we rely on technology


SourceTransistor Energy per Active Area Power

Delay Switch Power DensityBorkar99 [2] 0.70 0.34 0.49 0.49 1.00

ITRS07 (average) [50] 0.67 0.51 0.76 0.50 1.53

Table 2.2: Technology Scaling Factors. High-Performance Microprocessor Logic.Indicates a departure from historical scaling trends resulting in an increase in powerdensity. [50]

predictions made by ITRS 2007.

The circuits library utilizes predictive technology models (PTM) [47, 67], available

from the 45nm node down to 16nm, to model how power and frequency scale with

supply voltage and different amounts of wire parasitics. In the absence of detailed cir-

cuit blocks that can be simulated, we rely on HSPICE simulations of fanout-of-4 ring

oscillators across the technologies to determine basic frequency, power, and voltage

trends. We combine ITRS predictions with PTM-based simulations to extrapolate

trends at the 11nm node. These trends allow Navigo to scale voltage and frequency to

meet different power budgets. It is also important to consider the effects of imposing

minimum voltage (VddMIN) constraints since allowing arbitrary reductions in supply

voltage can lead to a variety of issues related to six transistor SRAM cell instability

issues [65] and exacerbation of on-chip voltage noise. Again, the circuits library can

be modified by the user to model specific blocks if available.

The architecture library contains a collection of processor cores that the user can

choose to tile together in future multi-core systems. The built-in architecture li-

brary consists of three cores currently in production, listed in Table 2.3. These cores,

Intel Xeon (Netburst), Intel Core2Duo (Core), and Intel Atom, represent high-end

server, desktop, and mobile CPUs. We plan to include analysis for processors such


ProcessorTech Die Cores Vdd Freq Power IPC(nm) Size (V) (GHz) (W) (SPEC06

(mm2) /GHz)

Intel Xeon 65 435 2 1.25 3.4 110 3.72(Tulsa) [18]

Intel Core2Duo 45 107 2 1.36 3 65 6.82(Wolfdale)

Intel Atom [12] 45 25 1 1.0 2.0 2.0 2.35

Table 2.3: Example Cores used in analysis. Data collected from conference andjournal publications and datasheets. SPEC2006 results used to determine IPC arefrom spec.org.

as Intels Core i7, as detailed information becomes available. Parameters for the

processors were obtained from publications and SPEC scores in spec.org for Xeon

and Core2Duo. Since official SPEC results are not available for Atom, we extrap-

olate based on benchmark comparisons between Atom and an Athlon with known

SPEC scores [57]. While different processors have been implemented with different

technologies, the power, performance, and area of each core is appropriately scaled by

Navigo utilizing the process technology and circuits trends prescribed by their respec-

tive libraries. The user is not constrained by these cores, but can also include other

user-defined cores into the architecture library. For example, Section 2.4 explores the

impact of specialized cores.

The market segment library identifies different market segment targets that con-

strain total area and maximum power. Table 2.4 lists examples of different market

segments. Throughout the rest of the chapter, we focus on two particular mar-

ket segmentsserver and mobile. The server market allows for a maximum area of

300mm2 and maximum power of 198W as defined by ITRS. In contrast, the mobile


Market Max Power (W) Die Area (mm2)MPU-CP Cost and Performance 151 140

MPU-HP High Performance 198 310MPU-PCC Power Cost and Connectivity 3 70

Desktop-95 95 100Desktop-65 65 100

Mobile Standard Voltage 35 100Mobile Ultra-low Voltage 10 100

Table 2.4: Market Segment Constraints. Die size and Max Power Consumptionfor a set of market segments. Values for the first three markets came from ITRS [50].The final four market segments are based on die size and thermal design point ofcommercially available Intel Processors.

market allows for a maximum area of 100mm2 and maximum power of 35W. Again,

different markets segments and/or constraints can be easily defined by the user via

changes to the library.

Finally, Navigos engine computes total throughput as follows:

Throughput = Ncores freq(V dd, tech) IPCcore (2.1)

where the number of cores, Ncores, is defined by the total die size (for a target market

segment) divided by the core chosen and scaled by technology node. The IPC of each

core can be derived from published (or simulated for new cores) SPEC benchmark

results and clock frequency of the core. Operating frequency depends both on process

technology and voltage, and is calculated based on the original frequency published

for the core. First, Navigo calculates the maximum frequency of the core for nominal

voltage in the new technology. We incorporate both the intrinsic switching delay of the

transistor and effects due to wire delay scaling. We scale logic and wires independently

because the projected trends follow competing directions and are modeled separately


in ITRS.

freqV ddNom=freqcorebasetech

fraclogicfreqswitchtech

freqswitchbasetech+fracwire

freqwiretechfreqwirebasetech

(2.2)

where basetech is the original technology in which the core was fabricated. The nom-

inal frequency is then multiplied by PTM-based scaling factors to calculate voltage-

specific frequencies.

Power depends on voltage, operating frequency, and the transistor switching rate

of the architecture. We model average power with the following expression:

Pavg = Pactive + Pleak freq (Eswitch Nswitching + Ewire) + Pleak (2.3)

Traditionally power consumption is modeled as a sum of active power and leak-

age power. Navigo computes active power as a sum of the number of transistor

switches per second multiplied by the energy per switch. We calculate switching rate

(Nswitching) from published frequency and power numbers. Since energy per switch

(Eswitch) is technology dependent, it scales based on voltage-dependent scaling factors

derived from HSPICE simulations for each technology node. Wires scale differently

from transistors and, hence, are separately accounted for. We assume leakage power

remains a fixed percentage of the total power consumption at maximum frequency

and nominal voltage, which then scales with respect to different operating voltage

levels. In order to accommodate different power budgets prescribed by different mar-

ket segments, Navigo iterates through voltage and frequency settings until a specific

power target is met. When the model encounters a VddMIN constraint, it scales

frequency only to reduce power at the expense of inefficient energy usage.


While Navigo seeks to combine a variety of factors to accurately predict future

performance, it makes several optimistic assumptions. First, it may not be feasi-

ble to fit an integer number of cores into a predefined area. Hence, we allow for

half-size cores with IPC and power that scale linearly by one half. Although this sce-

nario is infeasible, for near-term technologies (e.g. 45nm), large area cores introduce

quantization effects that make it difficult to observe consistent trends. This effect

becomes significantly less important as we scale to more advanced technologies. Sec-

ond, future multi- and many-core systems will face a variety of challenges to enable

core-to-core communications. Navigo optimistically assumes a perfect on-chip inter-

connection network. Lastly, and perhaps most important, we assume workloads can

be fully parallelized to keep all cores running continuously. Hence, the model is or-

thogonal to Hills investigation that compares single-threaded versus multi-threaded

parallelism [27]. One of the main objectives of developing Navigo was to provide a

detailed and yet flexible model to help designers predict performance trends and guide

future designs. Moreover, we use Navigo to show that despite optimistic assumptions

of perfect thread parallelism that are run on highly-parallel many-core designs, power

constraints will hamper performance growth and motivate designers to seek out new

solutions beyond simply increasing the number of cores on a die.

2.2 Power-constrained Performance for Multi-core

Navigo can be used to understand power-constrained performance scalability across

technology generations. In this section, we demonstrate the utility of Navigo by ex-

ploring the scalability of three classes of CPU architectures when considering power-


constrained market segments (Table 3) and the impact of the minimum supply voltage

constraint.

For each of these explorations, we make several assumptions. First, we assume

that area and power will be fixed by the market segment. More advanced technology

nodes provide an increase in the number of available transistors leading to a doubling

of available cores per technology generation; however, frequency benefits will be con-

strained by power limits. If the power budget is exceeded for a given number of cores

and clock frequency, we scale voltage and frequency down to meet the power bud-

gets, subject to circuit constraints on the supply voltage, after which linear frequency

scaling is utilized.

2.2.1 Results without Power Constraints

To understand the impact of power constraints on scaling, we first consider the

scenario where power is not a design constraint. Figure 2.2 illustrates this figure with

four sub-figures illustrating various outputs of the model when scaled across technol-

ogy nodes for a fixed area budget of 310 mm2. The four sub-figures quantify, across

the three core types, the number of cores, clock frequency, total power, and total

chip throughput. Without power constraints, all metrics scale up with technology.

Figure 2.2(a) shows that the number of Core2Duo cores starts at around 6 in the

45nm node (recall that core count is scaled to meet the 310 mm2 budget), scaling

to 93 cores by 11nm. Without power limitations, frequency scaling continues un-

abated surpassing 19.12 GHz for the Xeon core in 11nm, but this comes at the price

of increased power dissipation, exceeding a kilowatt in the worst case. Figure 2.2(d)


45 nm 32 nm 22 nm 16 nm 11 nm0

20

40

60

80

100

120

140

160

180

200

Technology

Num

ber

of C

ores

AtomXeonCore 2

(a) Number of cores

45 nm 32 nm 22 nm 16 nm 11 nm0

2

4

6

8

10

12

14

16

18

20

TechnologyF

req

(GH

z)

AtomXeonCore 2

(b) Frequency

45 nm 32 nm 22 nm 16 nm 11 nm0

500

1000

1500

Technology

Tot

al P

ower

(W

)

AtomXeonCore 2

(c) Power

45 nm 32 nm 22 nm 16 nm 11 nm10

1

102

103

104

105

Technology

Thr

ough

put

AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year (Core2)

(d) Throughput

Figure 2.2: Results without power constraints across process technologies.Results assume nominal voltage for specified technology and MPU-HP market segmentwith a die size of 310 mm2.


plots total chip throughput relative to the Core2Duo from the 45nm technology node,

as calculated by increasing the core count along with frequency improvement. The

throughput improvement increases at a slightly lower rate than the historical growth

rate of 1.58x. This shows that if power is not a constraint, performance growth could

be achieved through a combination of traditional frequency scaling and multi-core

design.

2.2.2 Results with Power Constraints

Incorporating power constraints into our analysis gives a true picture of expected

trends in future technologies. We show that for market segments that tolerate higher

power density systems, scaling trends are better compared to more constrained market

segments. In this section, we compare the server market segment, which uses the same

310 mm2 die with a power limit of 198W, and the mobile market segment, which uses

a 100 mm2 die with a power limit of 35W. Figure 2.3 and Figure 2.4 plot the server

and mobile market segment scalability analysis across the three core types. Each

plot shows the required supply voltage, clock frequency, total power, and total chip

throughput.

Focusing on the results for the server market segment, we observe several impor-

tant trends. For the Intel Xeon design, power is already constrained at the 45nm

technology node, and the design must reduce supply voltage from nominal in order

to meet the power goal. When moving to the 32nm node, the Xeon is able to achieve

a small frequency increase by operating at the minimum supply voltage. Beyond

32nm, the Xeon frequency reduces slightly and then flattens out as the power budget


45 nm 32 nm 22 nm 16 nm 11 nm0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

Technology

VD

D (

V)

AtomXeonCore 2

(a) Vdd

45 nm 32 nm 22 nm 16 nm 11 nm0

2

4

6

8

10

12

14

16

18

20

TechnologyF

req

(GH

z)

AtomXeonCore 2

(b) Frequency

45 nm 32 nm 22 nm 16 nm 11 nm20

40

60

80

100

120

140

160

180

200

220

Technology

Tot

al P

ower

(W

)

Atom

Xeon

Core 2

(c) Power

45 nm 32 nm 22 nm 16 nm 11 nm10

1

102

103

104

105

Technology

Thr

ough

put


(d) Throughput

Figure 2.3: Results with power constraints across process technologies -Server. Results assume nominal voltage for specified technology and MPU-HP marketsegment with a die size of 310 mm2 and max power of 198 W.


45 nm 32 nm 22 nm 16 nm 11 nm0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

Technology

VD

D (

V)

AtomXeonCore 2

(a) Vdd

45 nm 32 nm 22 nm 16 nm 11 nm1

2

3

4

5

6

7

TechnologyF

req

(GH

z)

AtomXeonCore 2

(b) Frequency

45 nm 32 nm 22 nm 16 nm 11 nm10

15

20

25

30

35

40

Technology

Tot

al P

ower

(W

)

AtomXeonCore 2

(c) Power

45 nm 32 nm 22 nm 16 nm 11 nm10

1

102

103

104

Technology

Thr

ough

put


(d) Throughput

Figure 2.4: Results with power constraints across process technologies -Mobile. Results assume nominal voltage for specified technology and Mobile marketsegment with a die size of 100 mm2 and max power of 35 W. Vdd is limited toVddMIN.


is soaked up by additional cores. In contrast, the Intel Core2Duo design allows full

frequency scaling until the 22nm technology node, after which scaling is curtailed;

in 11nm, frequency must be throttled when adding more cores. The Intel Atom

core is much more power-efficient and can continue to scale frequency until 11nm,

with additional power headroom. However, Atom starts with a significant perfor-

mance disadvantage compared to Core2Duo, and hence by 11nm, the Core2Duo and

Atom roughly converge on total throughput. In 11nm, the best designs (Atom and

Core2Duo) are increasing at a rate of 1.35x per year, which by 11nm is nearly 6.6x

below the 1.58x per year curve.

The mobile market segment, seen in Figure 2.4 exhibits similar trends, but the

tighter power constraints result in more severe reductions in clock frequency, and

slowing in overall per-year throughput growth. For example, the Core2Duo hits a

frequency cap around 32nm, and frequency flatlines until 16nm when it slightly dips.

Even the Atom processor power caps at 16nm, after which frequency also dips to

maintain the power budget.

An important issue that we see repeatedly throughout the above scenarios is the

minimum Vdd constraint is met as we seek to fit designs with many cores into fixed

power budgets by reducing voltage and clock frequency. When a design reaches

this constraint, additional power reduction can only be achieved through inefficient

frequency-scaling essentially linear reduction in clock frequency offsets additional

cores. Practically speaking, designers may prefer to simply stop scaling the num-

ber of cores in a system at this point. In order to understand this effect, we have

run additional simulations with the constraint removed; the results are shown in


45 nm 32 nm 22 nm 16 nm 11 nm0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Technology

VD

D (

V)

AtomXeonCore 2

(a) Vdd

45 nm 32 nm 22 nm 16 nm 11 nm1

2

3

4

5

6

7

Technology

Fre

q (G

Hz)

AtomXeonCore 2

(b) Frequency

45 nm 32 nm 22 nm 16 nm 11 nm10

15

20

25

30

35

40

Technology

Tot

al P

ower

(W

)

AtomXeonCore 2

(c) Power

45 nm 32 nm 22 nm 16 nm 11 nm10

1

102

103

104

Technology

Thr

ough

put


(d) Throughput

Figure 2.5: Results with power constraints across process technologies with-out VddMIN constraints - Mobile. Results assume nominal voltage for specifiedtechnology and Mobile market segment with a die size of 100 mm2 and max power of35 W. Vdd can be reduced without a lower limit.


Figure 2.5. We significantly reduce VDD to meet the power constraints set by the

market, as low as 0.6V for in advanced technologies and Xeon and Core2Duo mi-

croarchitectures. There is a clear loss in throughput for systems under minimum

VDD constraints, Figure 2.4 (d), compared to systems without minimum VDD con-

straints, Figure 2.5 (d). For the Atom processor, minimum Vdd is not a severe issue.

For the mobile market segment in the 11nm node, scaling VDD reduces throughput

by 13.4%. However, the minimum voltage constraint reduces the throughput of the

Xeon core by 57.6% for the same target. Even without this constraint, the Xeon still

performs poorly compared to the more power-efficient cores, because running at very

low voltage does not provide ideal performance.

2.3 Validating the Model

This section presents a back-validation of Navigo for microprocessors built from

1996 to 2007. Because of the predictive nature of the model, it is difficult to validate

Navigos predictions of the power and performance of microprocessors built using

future process technologies. Therefore, we validate the Navigo based on an initial

data-point from 1996 against Microprocessors manufactured over the last 10 years.

For validation, we seeded the microarchitecture library with the DEC Alpha 21164

microprocessor, introduced in 1996 and manufactured in 350nm technology. We de-

veloped the technology and circuits library based on ITRS data from 1997 to 2007

and circuit simulation results, using SPICE models from industry and PTM. In 1997,

the ITRS committee did not anticipate the growth in power density that started with

the 180nm technology node. Therefore, for each node, we chose the technology model


CPU Year Node(nm)

DieSize(mm)

Throughput Freq(GHz)

Power(W)

Alpha 21164 1996 350 210 481 0.5 31Alpha 21164 1997 350 141 649 0.6 40Alpha 21264 1998 350 314 993 0.6 73Alpha 21264A 1999 250 210 1267 0.7 85Pentium III 2000 180 106 1779 1.0 29Athlon 2001 180 130 2584 1.6 68Pentium 4 2002 130 146 4195 3.0 81.8Opteron 2003 130 193 5364 2.2 89Xeon 2004 130 237 5764 3.6 9264-bit Xeon 2005 90 81 6505 3.6 110Core 2 Extreme 2006 65 143 17909 2.93 75Xeon 3085 2007 65 143 23207 3 65POWER6 2007 65 341 35071 4.7 180

Table 2.5: Select Microprocessors from 1996 to 2007. Performance data is fromthe analysis in Figure 1.1. Power consumption and die size data was acquired fromdatasheets and published microprocessor reports.

from the ITRS year closest to date of introduction. This technique isolates the error

in ITRS predictions from the modeling framework.

We compare predictions from Navigo with microprocessors manufactured between

1996 and 2007, shown in Table 2.5. We calculate throughput from the same Hen-

nesey and Patterson and SPECint2006 benchmark data used to develop Figure 1.1,

described in Section 1.1.1. We gathered power consumption data from datasheets

and online microprocessor reports. The die size of the microprocessors vary widely;

therefore, we compare throughput per unit area and power per unit area.

Figure 2.6 (a) presents a comparison of throughput per unit area predicted with

Navigo and the throughput of commercially available microprocessors. The x-axis

represents both technology node and year of introduction. The throughput of the


350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006 10

0

101

102

103

Technology

Thr

ough

put/A

rea

NavigoCommercial Microprocessors

(a) Throughput

350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006 0

100

200

300

400

500

600

700

800

Technology

Pow

er/A

rea

(mW

/mm

2 )

NavigoCommercial Microprocessors

(b) Power

Figure 2.6: Validation of Navigo using Microprocessors from 1996 to 2007.Predicted results use the most recent ITRS technology models. The initial core modelis an Alpha 21164 0.5 GHz in 250nm technology introduced in 1996. The data pointsrepresenting commercially available systems are also presented in Figure 2.5

initial core, Alpha 21164 0.5 GHz, matches the predictions from Navigo which reveals

the absence of static offset errors in the model. The throughput predicted by Navigo

aligns well with the results from the benchmarked microprocessors. Generally, Navigo

estimates the upper bound of throughput per unit area. To combat increasing power

consumption, designers of microprocessors in the 65nm node slowed the scaling of

clock frequency and choose to design multi-core processors made of simpler cores.

Navigo overestimates the throughput of multi-core designs because it assumes that

the costs of communication and thread synchronization are zero.

While Navigo predicts a general trend of increased power density, shown in Fig-

ure 2.6 (b), it does not predict the drastic jump in power consumption caused by

changes in microarchitecture, as it assumes a fixed core design. During the period be-


tween 1997 and 2005, microarchitects aggressively pursued single-thread performance

resulting in several high-throughput and high-power consumption designs. The deeply

pipelined Netburst microarchitecture, manufactured in 130nm (Pentium 4 and Xeon),

had notoriously high power consumption. Subsequently, the industry changed course

and introduced more power efficient multi-core designs. The power consumption pre-

dicted by Navigo matches the initial core Alpha 21164 in 350nm. Navigo also aligns

well the multi-core designs in the 65nm node, which utilize cores that have microar-

chitectures similar to the Alpha. The model correctly shows the transition between an

earlier erawhen constant field scaling was still possible and power density remained

constant (350nm, 250nm, 180nm)and the current era of increasing power density.

Our back-validation shows that Navigo predicts throughput well and points out

general trends in power consumption. Navigo incorporates a static model of mi-

croarchitecture, and thus for a more accurate prediction of power consumption, users

should include cores in their libraries which best represent their target core design.

2.4 Modeling Specialization

Consistent progress towards smaller, faster, and more numerous transistors with

each generation of process technology no longer yields the steady growth in comput-

ing performance enjoyed throughout the 20th century. The power ceiling forced a

right-hand turn in single-thread performance and CPU designers have been rac-

ing to implement multi-core systems ever since. Unfortunately, Navigo predicts that

even for the server market segment, multi-core scaling will only yield a 1.35x/year

performance growth trend. In order to get back onto the 1.58x growth trend, design-


ers must maximize the efficiency of transistor (and wire) switching. In other words,

designers must minimize the overheads associated with a general-purpose (GP) CPU.

One obvious direction is to replace general-purpose computing with dedicated, spe-

cialized hardware that offers higher computation per unit area and power, for an

increasing fraction of the machines workload. IBMs CELL processor is one such

example. It includes 8 SPEs, which are specialized cores used to speed up SIMD

workloads [11]. Similarly graphics processing units (GPUs) have been used exten-

sively by programmers to speedup tasks related to video processing and other SIMD

operations. Another example may be to introduce dedicated hardware specialized to

H.264 decoding. In order to understand the potential benefits of specialization, this

section introduces a parallel-variant of Amdahls Law for specialization. Then, by

augmenting Navigo with specialization, we project the amount of specialization that

will be required in future computing systems to increase system throughput by 1.58x

per year.

2.4.1 Variant of Amdahls Law for Specialization

Amdahls Law is commonly used to describe the theoretical limitations of appli-

cation speedup given constraints on the fraction of the workload that can be sped

up.

Speedupenhanced(f, S) =1

(1 f) + fS

(2.4)

where f is the fraction of the workload that can be enhanced and S is amount of

speedup possible through enhancements. Amdahls Law has been adapted to model

symmetric and asymmetric multi-core systems [27], where parallel cores can execute


(a) Calculation framework

101

100

100

102

100

101

102

103

fraction of workload (f)Speedup (S)

Thr

ough

put (

norm

aliz

ed)

(b) Throughput vs f and S

Figure 2.7: Speeding up an application with specialized cores. A workloadis split to an additional set of resourcesthe specialized core. The fraction of theapplication that can be executed on the specialized core is f , with a speedup of S.

all workloads. With specialized cores, we must make a few assumptions in order to

model speedup using Amdahls Law. First, we assume special-purpose (SP) cores

can only run specific parts of an application (f) while general-purpose cores can run

the entire workload, albeit with lower efficiency. Second, we optimistically assume

that workloads are arbitrarily parallelizable (also previously assumed in Navigo). T