Accelerator-Based Architectures for Wireless Sensor Network ...

Accelerator-Based Architectures for WirelessSensor Network Applications
A dissertation presented
by
Mark David Hempstead
to
School of Engineering and Applied Science
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Engineering Sciences
Harvard University
Cambridge, Massachusetts
May 2009

c2009 - Mark David Hempstead
All rights reserved.

Thesis advisor Author
David Brooks and Gu-Yeon Wei Mark David Hempstead
Accelerator-Based Architectures for Wireless Sensor Network
Applications
Abstract
Growing power consumption threatens the explosive growth that the semiconductor
industry has sustained over the last several decades. While the number of transistors
continues to double every process technology generation, the slowing of constant field
scaling has caused power density to increase limiting clock frequency. To combat these
trends, designers must get more performance from each transistor switch. Technology
companies are applying microprocessors to a growing diversity of applications that
are increasingly mobile and untethered from the power grid. One such domain is
the emerging area of wireless sensor networks (WSNs) where, because nodes are
often deeply embedded in an environment, power consumption is the primary design
constraint.
This dissertation explores the challenges of designing in a power-constrained era
through the development of a model we call Navigo and the design and implemen-
tation of an accelerator-based architecture for WSNs. We designed Navigo to aid in
early architecture exploration as an alternative to the spreadsheets and back-of-the-
envelope calculations that planners use to guide future designs. The results show
that, even under ideal conditions, multicore processors will not achieve the perfor-
mance gains necessary to maintain growth. This dissertation shows that if an increas-
iii

Abstract iv
ing amount of area per technology node is allocated to specialized accelerators, then
microprocessor performance growth will be maintained.
As a case study of accelerator-based architectures, we developed a processor for
WSNs. Our architecture includes accelerators for regular tasks and event handling is
offloaded to the event processor, removing the software overhead of a general purpose
design. Because the architecture is modular, VDD-gating can be employed to address
leakage current at the architecture level. We built a prototype in 130nm CMOS. We
compare our system to other systems in the literature and a general purpose-based
design. Our system has the lowest energy per equivalent instruction and results of
our workload analysis shows the system is suited both for low-intensity and high-
performance WSN applications.

Contents
Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiCitations to Previously Published Work . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
1 Introduction and Summary 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Market Requirements . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Accelerator-based Architectures . . . . . . . . . . . . . . . . . . . . . 71.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Navigo: A Model to Study Power-Constrained Architectures andSpecialization 132.1 Navigo: A Model for Performance Trends in Future Technologies . . . 15
2.1.1 Modeling Methodology and Sample Libraries . . . . . . . . . . 172.2 Power-constrained Performance for Multi-core . . . . . . . . . . . . . 23
2.2.1 Results without Power Constraints . . . . . . . . . . . . . . . 242.2.2 Results with Power Constraints . . . . . . . . . . . . . . . . . 26
2.3 Validating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Modeling Specialization . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Variant of Amdahls Law for Specialization . . . . . . . . . . . 352.4.2 Examples of Specialized Cores . . . . . . . . . . . . . . . . . . 38
2.5 Model Limitations and Future Directions . . . . . . . . . . . . . . . . 42
v

Contents vi
3 An Ultra Low Power Event Driven Architecture for WSNs 463.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Overview of WSN Applications . . . . . . . . . . . . . . . . . 483.1.2 PowerTOSSIM Modeling Commercially Available Systems for
WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.1.3 Low-Power Circuit Design Techniques . . . . . . . . . . . . . . 573.1.4 Energy Scavenging . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Goals of the Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 System Bus Description . . . . . . . . . . . . . . . . . . . . . 663.3.2 Event Processor Specification . . . . . . . . . . . . . . . . . . 683.3.3 Description of Accelerators and Other Blocks . . . . . . . . . 70
3.4 Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 743.4.1 Performance Modeling - SystemC Simulator . . . . . . . . . . 753.4.2 Test Application . . . . . . . . . . . . . . . . . . . . . . . . . 753.4.3 Cycle Performance Estimates . . . . . . . . . . . . . . . . . . 78
3.5 Selection of Process Technology . . . . . . . . . . . . . . . . . . . . . 793.5.1 Background on Technology Scaling . . . . . . . . . . . . . . . 803.5.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 813.5.3 Modeling Architecture Across Process Technologies . . . . . . 853.5.4 Results of System Analysis . . . . . . . . . . . . . . . . . . . . 89
4 Silicon Implementation and Evaluation of Accelerator Based Sys-tems 994.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1.1 Design Flow and Tools Used . . . . . . . . . . . . . . . . . . . 1014.1.2 VDD-gate circuit . . . . . . . . . . . . . . . . . . . . . . . . . 1024.1.3 Die-Photo and Test Chip Specifications . . . . . . . . . . . . . 103
4.2 Measurements of Prototype . . . . . . . . . . . . . . . . . . . . . . . 1044.2.1 Test Methodology and Setup . . . . . . . . . . . . . . . . . . . 1054.2.2 Functional Verification . . . . . . . . . . . . . . . . . . . . . . 1064.2.3 Block Level Power Measurements . . . . . . . . . . . . . . . . 1084.2.4 Energy per Task and Energy per Instruction . . . . . . . . . . 110
4.3 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . 1124.3.1 Categorization and Description of Similar Systems . . . . . . . 1124.3.2 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 113
4.4 Comparison to General Purpose Microcontroller . . . . . . . . . . . . 1154.4.1 Performance and Energy Benefits of Specialization . . . . . . . 1164.4.2 Workload Analysis and DVFS . . . . . . . . . . . . . . . . . . 119
4.5 Using Navigo to Guide Future Revisions . . . . . . . . . . . . . . . . 122

Contents vii
5 Conclusion and Future Directions 1265.1 Summary of Themes and Results . . . . . . . . . . . . . . . . . . . . 1275.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2.1 Improved Modeling Frameworks . . . . . . . . . . . . . . . . . 1295.2.2 Memory Systems for Accelerator-Based platforms . . . . . . . 1305.2.3 Applying Accelerator-Based Architectures to Desktop/Mobile
platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Bibliography 134
A Related Work: Description of Similar Systems 141A.1 General Purpose Commodity Based Systems . . . . . . . . . . . . . . 141A.2 Smart Dust - Early Event Driven . . . . . . . . . . . . . . . . . . . . 142A.3 Subthreshold Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A.4 Asynchronous - SNAP . . . . . . . . . . . . . . . . . . . . . . . . . . 146A.5 Charm - Network Stack Acceleration . . . . . . . . . . . . . . . . . . 148
B Detailed Design Documents 150B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.2 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.3 Interrupt Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153B.4 Power Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

List of Figures
1.1 Growth in Microprocessor Performance. Historically the indus-try has observed a total 1.58x performance gain per year. Power con-sumption constraints inhibit performance growth causing a gap betweenexpected and delivered performance. Data from Hennessy and Patter-son [25] and spec.org [54]. . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Approach. We take a holistic approach to research understanding and addressing power consumption at all layers of thedesign space. Architecture innovations are informed by modeling andprototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Graphical depiction of Navigo. The model accepts library files forprocess technology, circuits, architecture, and market segments, andcomputes total and constrained power for a set of user-defined inputssuch as supply voltage, frequency, etc. . . . . . . . . . . . . . . . . . . 16
2.2 Results without power constraints across process technolo-gies. Results assume nominal voltage for specified technology andMPU-HP market segment with a die size of 310 mm2. . . . . . . . . . 25
2.3 Results with power constraints across process technologies -Server. Results assume nominal voltage for specified technology andMPU-HP market segment with a die size of 310 mm2 and max powerof 198 W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Results with power constraints across process technologies -Mobile. Results assume nominal voltage for specified technology andMobile market segment with a die size of 100 mm2 and max power of35 W. Vdd is limited to VddMIN. . . . . . . . . . . . . . . . . . . . . 28
2.5 Results with power constraints across process technologieswithout VddMIN constraints - Mobile. Results assume nominalvoltage for specified technology and Mobile market segment with a diesize of 100 mm2 and max power of 35 W. Vdd can be reduced withouta lower limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
viii

List of Figures ix
2.6 Validation of Navigo using Microprocessors from 1996 to 2007.Predicted results use the most recent ITRS technology models. The ini-tial core model is an Alpha 21164 0.5 GHz in 250nm technology in-troduced in 1996. The data points representing commercially availablesystems are also presented in Figure 2.5 . . . . . . . . . . . . . . . . 33
2.7 Speeding up an application with specialized cores. A workloadis split to an additional set of resourcesthe specialized core. Thefraction of the application that can be executed on the specialized coreis f , with a speedup of S. . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Understanding the impact of specialization on throughput.Calculations of throughput with specialization for different speedups (S)and fractions of workload (f). Assumes the general purpose core isfully utilized and resources for an additional specialized core has beenprovisioned. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Specialization across process technologies with real SP cores.Total throughput for different values of f assuming the area and speedupof one example SP core per GP core. Mobile 35W market segment. . 40
2.10 Configurations that can achieve 1.58x/year throughput. Modeltwo different accelerator structures the programmable CELL SPE andan H.264 accelerator. Core2Duo-based GP cores and the Mobile 35Wmarket assumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Measured and simulated current consumption for the Beaconapplication. The simulated version includes a breakdown according toradio, LEDs, and CPU current. A lower resolution digital multi-meterwas used for the above measurement, which did not capture the veryshort duration peak power spikes during the wakeups. . . . . . . . . . 55
3.2 Surge Application Power Consumption Breakdown. 60 sec ofthe surge TinyOS application run on the Mica2 mote. . . . . . . . . 56
3.3 System Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . 653.4 Event Processor State Machine . . . . . . . . . . . . . . . . . . . 683.5 Diagram and Code of the Monitoring Application. The code
displayed are ISR routines written for the event processor. Actual ad-dress values have been omitted to make the code easy to read. . . . . . 76
3.6 Test Circuit Used for Simulations. The circuit consists of an 11stage ring oscillator made up of an assortment of logic gates. Inter-connect was modeled between devices. . . . . . . . . . . . . . . . . . . 81
3.7 Leakage Power, EDP, and Frequency Across all TechnologiesEach line indicates a technology node from 180nm to 70nm. Supplyvoltage is on the X-axis which was swept from 0.1V to the max VDDspecific to the process. Temperature is 20C and all transistors areminimum size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

List of Figures x
3.8 Results for Baseline Architecture. Performance target of N=100sense and transmit tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.9 Effect of Energy Reduction Techniques on Total Energy Con-sumption of the Architecture Across Process Technologies.Power Supply voltage is limited to V tP + V tN and the number of tasksper second is 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.10 Summary of Energy Reduction Techniques Across ProcessTechnologies Each bar represents the minimum energy calculated fora particular architecture configuration and process technology. Both thetotal energy consumption and a percentage breakdown of the source ofenergy consumption are included. . . . . . . . . . . . . . . . . . . . . 95
4.1 Custom VDD-Gating Circuit. The schematic shows four differentparallel legs which are used to control VDD-gating strength. Layout ofthe filter component shows where the VDD-gating circuit is attached.In this example, the VDD-gating circuit requires an additional area of3.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Die Photograph of 130nm Prototype. System includes an eventprocessor and several accelerators for regular operation. The systemhas been realized in 130nm CMOS on a 2mm x 2mm die. The systemcontains 444,982 transistors including 4KB of foundry supplied SRAM. 104
4.3 Frequency verses Voltage Shmoo. Shaded region of plot indicateswhere the test failed the unshaded region indicates successful opera-tion. Results from a full run of the sense and transmit application wereused to generate a shmoo. Due to limitations of the test board the chipwas measured up to 12.5 MHz. The shmoo generated using post layoutsimulations indicate the chip will work up to 100 MHz . . . . . . . . . 107
4.4 Measured power consumption of the prototype under differ-ent supply voltages and clock frequencies. Plots a-c show thepower consumption for the Event Processor, Accelerator, and SRAMpower domains while sweeping voltage from 450 MV to 800 MV andfrequency from 25 kHz to 12.5 MHz. Idle power is measured with theexternal clock off (0MHz @550mV). The VDD-gating transistor is off(not-conducting) during the measurement of gated power. . . . . . . 109
4.5 Energy per Task of Sense and Transmit Task. Application in-cludes all accelerator blocks and power contributions from the SRAMand Event Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.6 Comparison to Other Systems Designed for WSN. . . . . . . . 1144.7 Performance and Power Benefits of Specialization. Test rou-
tines were executed both on the hardware accelerators and the micro-controller. Cycle count and energy savings are presented. . . . . . . 118

List of Figures xi
4.8 Evaluation of Accelerator-based Architecture vs. GeneralPurpose System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9 WSN architecture projected to advanced process technologiesand power budgets. Die size, f , S are fixed to based on measure-ments of the original system. Area is swept and the configuration withthe maximum throughput is reported for three different power budgets. 125
A.1 Smart Dust Microarchitecture[59]. . . . . . . . . . . . . . . . . . . . . 143A.2 Block Diagram of the Subliminal Processor (University of Michigan)[51].145A.3 Simplified block diagram of the SNAP processor for WSN. System
includes separate instruction and data memories, a timer coprocessor,and a message processor which provides a FIFO interface to the off-chipradio and sensors[9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.4 The Charm protocol processor microarchitecture[52]. . . . . . . . . . 148

List of Tables
2.1 Predicted Process Technology Characteristics. High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50]. . . . . . . . . . . 18
2.2 Technology Scaling Factors. High-Performance MicroprocessorLogic. Indicates a departure from historical scaling trends resultingin an increase in power density. [50] . . . . . . . . . . . . . . . . . . . 19
2.3 Example Cores used in analysis. Data collected from conferenceand journal publications and datasheets. SPEC2006 results used todetermine IPC are from spec.org. . . . . . . . . . . . . . . . . . . . . 20
2.4 Market Segment Constraints. Die size and Max Power Consump-tion for a set of market segments. Values for the first three marketscame from ITRS [50]. The final four market segments are based on diesize and thermal design point of commercially available Intel Processors. 21
2.5 Select Microprocessors from 1996 to 2007. Performance data isfrom the analysis in Figure 1.1. Power consumption and die size datawas acquired from datasheets and published microprocessor reports. . . 32
2.6 Specialized Cores. Example SP cores used in the model. All mea-surements were scaled to 65nm technology and speedup was calculatedby comparing published performance results to the performance on ageneral purpose CPU. The Core2 is included to show the relative areaand performance cost of including another GP core instead of an SPcore. Power and speedup for CELL SPE running Linpack. . . . . . . 39
3.1 Sensor Sampling Rates of Different Phenomena . . . . . . . . 493.2 Example WSN application domains. . . . . . . . . . . . . . . . . 503.3 Power model for the Mica2. The mote was measured with the
micasb sensor board and a 3V power supply. . . . . . . . . . . . . . . 543.4 Event Processor Instruction Set . . . . . . . . . . . . . . . . . . 693.5 Comparison of cycle count for the test application written on
our architecture and on TinyOS for the Mica Platform. . . . 783.6 Scaling Factors From theory and simulation data . . . . . . . . . . 87
xii

List of Tables xiii
3.7 Activity Ratios for Our Test Application . . . . . . . . . . . . 88
B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . . . . . . . 151B.2 System Memory Map All addresses are in hex . . . . . . . . . . . 152B.3 System Interrupt Map Lists all of the interrupts in the prototype
and the source of the interrupt. . . . . . . . . . . . . . . . . . . . . . 153B.4 Power Domains in the PrototypeLists all of the power domains
in the prototype including virtual power domains and power domainsfor testing only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Citations to Previously Published Work
The architecture presented in Chapter 3 first appeared in the following paper:
An ultra low power system architecture for sensor network applications,Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu-Yeon Wei, andDavid Brooks, In The 32nd Annual International Symposium on Com-puter Architecture (ISCA), June 2005.
The PowerTOSSIM simulator, presented in Section 3.1.2 including figure 3.1, ap-peared in:
Simulating the Power Consumption of Large Scale Sensor Network Ap-plications, Victor Shnayder, Mark Hempstead, Bor-Rong Chen, GeoffWerner Allen, and Matt Welsh, In Proceedings of the Second ACM Con-ference on Embedded Networked Sensor Systems (SenSys), Baltimore,MD, Nov 2004.
The evaluation of process technology selection, presented in Section 3.5, appeared in:
Architecture and Circuit Techniques for Low-Throughput, Energy-ConstrainedSystems Across Technology Generations, Mark Hempstead, Gu-YeonWei and David Brooks, In Proceedings of the International Conference OnCompilers, Architecture, And Synthesis For Embedded Systems(CASES).Seoul South Korea. October 2006.
The related work, presented in Section 4.3 and Appendix A was first surveyed in thefollowing invited paper:
Survey of hardware systems for wireless sensor networks, Mark Hemp-stead, Michael J. Lyons, David Brooks and Gu-Yeon Wei. ASP Journalof Low Power Electronics, Vol. 4., No. 1, April 2008.
The Navigo model presented in Chapter 2 is currently under submission in the fol-lowing paper:
Navigo: A Model to Study Power-Constrained Architectures and Spe-cialization, Mark Hempstead, Gu-Yeon Wei, and David Brooks [UnderSubmission]
The measurement results of our prototype, presented in Chapter 4, are currentlyunder submission:
An accelerator-based wireless sensor network processor in 130nm CMOS,Mark Hempstead, David Brooks, and Gu-Yeon Wei, [In preparation]
xiv

Acknowledgments
The path to this PhD has been an adventure, and I would like to take this op-
portunity to thank all of those who have helped and supported me along the way.
Throughout my journey the path was often hard to find and, without the guidance and
encouragement from these individuals, I would have never overcome the academic,
technical, and emotional challenges that blocked my way.
First, I would like to thank my advisers Gu-Yeon Wei and David Brooks for taking
a chance on me to start a fruitful collaboration across the disciplines of circuit design
and architecture. Throughout the last few years they have supported and guided my
transformation as a researcher. I appreciate the endless hours they spent providing
feedback on talks, papers, and chips, pushing me to think more deeply. Early in
my research career I received valuable feedback from my qualification committee,
Woodward Yang and Paul Horowitz. I am grateful to Margo Seltzer for her instruction
in paper writing and presentations in CS261 and, more recently, for agreeing to serve
on my dissertation committee.
Throughout the duration of my research project, several individuals helped me
with architecture exploration and early Verilog coding, including: Nikhil Tripathi,
Patrick Mauro, and Xiaoyao Liang. Michael Lyons and I have enjoyed a strong col-
laboration brainstorming the design of SMASH, next generation architecture. I wish
to thank the other members of the Mixed-signal VLSI and Architecture groups: Am-
ber Tan, Ruwan Ratnayake, Andrew Liu, Hayun Chung, Ankur Agrawal, Wonyoung
Kim, Durlov Khan, Meta Gupta, Benjamin Lee, VJ Reddi, and Kevin Brownell.
They provided invaluable instruction and support when I was met with problems
using CAD tools, test equipment, and architecture simulators. Moreover, they were
xv

Acknowledgments xvi
the source of supportive conversations at lunch, over dinner and during late night
tape-outs.
Halfway through my grad student career, our group received the gift of Glenn
Holloway, whose management of our machines and debugging support at all hours
saved me weeks of frustration. Jim MacArthur in the Cruft circuits lab was an
invaluable resource when I needed help designing PCBs, soldering, or finding random
parts. Because my research crossed into the systems realm, early collaborations with
the wireless sensor network (WSN) groupincluding Matt Welsh, Geoffrey Werner
Challen, Victor Shnayder, and Bor-Rong Chenhelped me understand the needs
of the WSN community. Im thankful to UMC and the SRC for supporting the
fabrication of my two test chips. I would like to thank Joel Emer, Mark Charney, and
Geoffrey Loweny for hosting me at Intel in Hudson, MA for a summer and exposing
me to research in higher performance systems.
For me grad school was more than just researchI had the opportunity to en-
gage in a diverse set of opportunities from teaching to graduate student organization
and the Harvard house system. Harry Lewis introduced me to his unique course,
QR48:BITS, and he was a wonderful teaching mentor who gave me the chance to try
my hand at lecturing. Likewise, Woodward Yang showed me how to coach students
in engineering design in ES96. Im thankful to Hwa Chang and Jeffery Hopwood at
Tufts for mentoring me after I took over the digital logic class this semester. I would
like to encourage the students who have taken over the graduate student life commit-
tee to continue the good work of building a community within SEAS and motivating
graduate students to leave their labs occasionally. For the past three years, my fellow

Acknowledgments xvii
tutors, masters, and students have made Lowell House into a vibrant and supportive
home.
Throughout my graduate school experience, it was the support of my caring friends
and family that kept me going. Specifically, I would like to thank my parents, David
and Rolande, who brought me up with such caring and supported me with a smile
when I turned down a job in the real world for graduate school. My father, who
taught me to think like an engineer at a young age through his probing questions at
the dinner table, continues to challenge me today. My mother, who rightly believes I
need emotional support just as much as technical support, continues to pick me back
up after each paper rejection. My sister Amy was my lifeline here in Boston over the
past few years. Though she has suppressed her engineering genes, she continues to
surprise me with a display of her scientific mind over a bottle of wine. My brother
Chris, the more practical engineer, taught me how to put a square peg in a round
hole with a big hammer. His thoughtfulness and ingenuity just might convince
me to start a company with him ... someday. Finally, I cannot give enough thanks
to Megan, whose caring, kindness, and support over the last few years made this
dissertation possible and easier to read. I look forward to many more adventures
together and one more dissertation between us.

Dedicated to those who have paved the way for me
my parents David and Rolande,
and my grandparents David and Margaret Hempstead, and Rudy and
Lillian Perreault.
xviii

Chapter 1
Introduction and Summary
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Market Requirements . . . . . . . . . . . . . . . . . . . . . 4
1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Accelerator-based Architectures . . . . . . . . . . . . . . 7
1.4 Summary of Contributions . . . . . . . . . . . . . . . . . . 8
Advances in computational capabilities have driven the information technology
revolution, which in turn has driven advances in nearly all fields of science, medicine,
and business. Although incredibly powerful computing devices are available today,
this single-minded pursuit of performance has made power consumption one of the
main bottlenecks for nearly all types of computing systems, from high-end servers
to wireless sensor devices. Due to limitations in device cooling at the high-end and
battery technology at the low-end, processor designs are increasingly stratified into
power-constrained market segments in which the challenge is to increase processor
1

Chapter 1: Introduction and Summary 2
performance for a fixed power budget. While advanced fabrication technology are
projected to continue to provide computer designers a doubling of transistors per
generation, slowing constant-field scaling and worsening wire parasitics will see the
energy per switching event scale at a rate in which chip power will essentially re-
main constant with fixed clock frequency and core activity. Current trends towards
large multi-core systems utilize the additional transistor bounty for additional power-
efficient cores but, with single-thread performance saturated, most benefits will come
through thread-level parallelism. Assuming an optimistic scenario for the continued
extraction of thread-level parallelism from workloads, chip performance gains will
track growth in transistor counts. The International Technology Roadmap for Semi-
conductors (ITRS) projects a doubling in the number of transistors every three years
(e.g., 1.25x per year) leading to an increasing gap between projected performance
growth and historical performance growth rates. Bridging this performance gap will
require an architectural paradigm shift to augment the multi-core trend, in which
an increasing fraction of chip real estate must be devoted to specialized logic that
provides significant benefits in performance per switching event for a growing portion
of workloads.
This dissertation argues that maintaining growth in system performance requires
using transistors more efficiently to achieve higher performance per watt. The power
consumption of a computing device depends on all layers of the design space, from
the application software, to circuits and process technology and system architecture.
This work takes a holistic approach by developing models and designs incorporating
all layers of the design space. In this chapter, we describe the technology and mar-

1985 1990 1995 2000 2005 2010 2015 2020
102
104
106
Year
CP
U P
erfo
rman
ce
Histor
ical T
rend:
1.58x
Power Constrained Era
Multi-core
Single-thread
Performance Predictions
Figure 1.1: Growth in Microprocessor Performance. Historically the industryhas observed a total 1.58x performance gain per year. Power consumption constraintsinhibit performance growth causing a gap between expected and delivered performance.Data from Hennessy and Patterson [25] and spec.org [54].
ket conditions that motivate this work and our holistic approach. We also describe
accelerator-based architectures in general and allude to a prototype that we designed
and taped-out for this work. Finally, we summarize the main contributions of this
work.
1.1 Motivation
1.1.1 Technology and Trends
Over the past few decades the performance of microprocessors has grown steadily.
However, over the past several years designers have been forced to slow the growth

of single thread performance because of increasing power consumption. To explore
these trends, Figure 1.1 plots both historical performance growth and projected multi-
core and single-threaded performance growth until 2020. All data in the plot is
relative to the VAX 11/780 as measured by SPECint benchmarks data in the plot
previous to 2005 was obtained from Hennessy and Patterson, and data for recent years
was obtained using the highest single-die performance SPECint2006 (single-thread)
and SPECint2006rate (multi-core) from the SPEC website [25, 54]. Performance
growth began to deviate from the historical 1.58x per year trend in 2001, primarily
due to the difficulty of obtaining additional clock frequency and instruction-level
parallelism improvements in the face of power constraints. The computing industry
has reacted to this trend by concentrating on multi-core designs that capture thread-
level parallelism. Unfortunately, as detailed in this work, power issues will limit
multi-core performance growth from meeting the historical trend, and closing this
gap will require more efficient use of transistors.
1.1.2 Market Requirements
The growth of the semiconductor industry has not only been driven by perfor-
mance gains but also by a growing diversity of applications for microprocessors. Mi-
croprocessors have moved out of government and corporate computing centers into
homes, schools, coffee shops, and, now, pockets and pocket books. As microproces-
sors have found additional uses beyond high performance and desktop computing,
new design constraints are being applied to microprocessors among them power,
size, and cost.

Power consumption is increasingly the primary design constraint for mobile and
embedded devices, as designers try to maximize battery life and reduce cooling cost.
The performance and power consumption requirements across market segments vary
by several orders of magnitude high-performance servers have a power limit of
200W while some processors for laptops and netbooks are designed to consume a
maximum of 1-10 W (Chapter 2 includes a more detailed list of market segments and
power constraints). The power constraints imposed by the market are contradictory
to the increase in power density caused by technology scaling. Because, mobile and
embedded devices are untethered from the power grid, power consumption has been
a concern within these communities for some time.
The emerging market segment of wireless sensor networks (WSNs) places even
more stringent power constraints on processor design and therefore is an indicator
of what is to come for the other market segments in the future. Wireless sensor
networks have applications in medicine, science, industrial automation and security.
WSN nodes are often deeply embedded in an environment and decoupled from the
wired power grid. Consequently, designers would like used scavenged energy to power
WSN devices indefinitely. Currently available energy scavenging methods place a
power consumption constraint of roughly 100W on microprocessors designed for
environmentally powered WSNs (a more detailed background of WSNs and energy
scavenging is presented in Section 3.1). These strict limits on power consumption
provide increased design pressure to maximize performance-per-watt. As technology
scales and power density increases, other market segments will face similar design
challenges.

83
Research Strategy
Application
Holistic Approachaddresses power
consumption at all layers
Architecture informed by modeling and prototyping
Architecture
Circuits
Process Tech
Network
Circuit Simulations
Prototyping
Design (Architecture/Circuits)
Modeling (Power + Performance)
(a) Holistic Approach 83
Research Strategy
Application
Holistic Approachaddresses power
consumption at all layers
Architecture informed by modeling and prototyping
Architecture
Circuits
Process Tech
Network
Circuit Simulations
Prototyping
Design (Architecture/Circuits)
Modeling (Power + Performance)
(b) Research Cycle
Figure 1.2: Research Approach. We take a holistic approach to research un-derstanding and addressing power consumption at all layers of the design space. Ar-chitecture innovations are informed by modeling and prototyping.
This work investigates the impact of technology scaling on power consumption. As
this section has described, the pressures of a power-constrained era require designers
to think about improving performance per watt by using transistors more efficiently.
This work takes a holistic approach looking at all areas of the design space, using the
emerging domain of WSNs as a case study in ultra-low power design.
1.2 Holistic Approach
During the course of our research, we have taken the view that all layers of the
design space influence power consumption, from the application and network to the
architecture and circuits. Figure 1.2 provides a graphical description of the research
approach we employed. Our research efforts follow an iterative approach through

modeling, design and prototyping and our models incorporate inputs from a variety
of design layers. For example, the PowerTOSSIM model (Section 3.1.2) accepts inputs
from the network and application layers and physical power measurements of nodes
while the Navigo model (Chapter 2) takes data from circuit simulations, process
technology data and performance benchmarks of different architectures.
We use modeling to guide design decisions which are verified by circuit simulations
and prototyping. Chapter 3 describes a design motivated by the modeling of appli-
cation behavior and addresses leakage current, which is increasing due to technology
scaling. Because our power consumption targets are so low, we developed a prototype
in 130nm CMOS to verify that our design achieves ultra low power operation. Both
the power and performance measurements of the prototype, presented in Chapter 4,
prompt more analysis and modeling of generalized accelerator-based architectures.
Consequently, results from our prototype and modeling efforts will drive our future
research efforts.
1.3 Accelerator-based Architectures
Both the trends in technology and market pressures to increase power efficiency
reveal the need to extract more computation for each transistor switch. Many de-
signers intuitively believe that application specific integrated circuits (ASICs) pro-
vide higher performance and increased energy efficiency over general purpose based
designs. However, ASICs are tuned for a particular set of computations and hence do
not posses the flexibility and programmability of a general purpose processor. One
approach, used by the system-on-chip community, places ASIC accelerators on a chip

with a general purpose microcontroller. As we show in this work, an accelerator-based
approach has the potential to compensate for the loss of performance due to power
constraints. We show that maximizing total system performance requires that the
accelerators provide application speedup (S) for a large fraction of the workload (f).
The regular nature of computation and the ultra-low power requirements of the
WSN application domain make it well-suited to benefit from an accelerator-based
architecture. As a case study of accelerator-based architectures, we designed and
implemented a processor for WSN applications. Our implementation utilizes the
modular nature of the architecture to turn off unused accelerators and address leakage
current with architecture. We also do away with the notion that the system needs
to be controlled by a high powered general purpose core and, instead, we replace it
with an event-driven state machine. Traditionally, the energy efficiency of a system
has been evaluated through the metric of energy-per-instruction. The concept of
instruction is lost on accelerator-based architectures and, therefore, we propose several
new methods to analyze the efficacy of our prototype.
1.4 Summary of Contributions
This work presents the combined contributions of four different modeling and
analysis frameworks and a ground-up silicon implementation of a processor for wire-
less sensor networks. Following the research approach presented in Section 1.2, the
modeling frameworks are informed by several layers of the design space applica-
tions, architecture, circuits, and process technology. The Navigo model, presented
in Chapter 2, accepts libraries that describe architecture features, process technology

characteristics, voltage and frequency relationships from circuit simulations. Through
the analysis of the inputs, Navigo reports an estimate of performance and power con-
sumption for future generations of microprocessors. The results revel that power con-
sumption increasingly limits performance. Subsequent analysis with Navigo shows
that specialization can provide the necessary performance-per-watt. However, the
high level analysis from Navigo needed to be grounded in a real implementation to
understand the benefits and costs of accelerator-based architectures. The design of
our prototype was informed by our modeling efforts of wireless sensor network applica-
tions with PowerTOSSIM, presented in Section 3.1.2, and a understanding of process
technology trends. Likewise, the architecture of the prototype drives the analysis in
Chapter 4 and the process technology study in Section 3.5. Through the models and
prototype, this work presents the following insights and major contributions.
Navigo: A Model to Study Power-Constrained Architectures and Specialization (Chap-
ter 2)
Modeling Framework for Early Exploration Currently designers use intuition
and spreadsheet-based models to explore design decisions and estimate power
consumption and performance of architectures five to fifteen years away from
tape-out. Navigo provides features not available in spreadsheet-based models
including voltage-frequency scaling to meet power constraints and input from
circuit simulations. By incorporating different architecture models, Navigo can
be used to model massive multi-core designs.
Amdahls Law for Specialization We enhanced Amdahls law to model het-
erogeneous accelerators that can provide a speedup (S) for a fraction of appli-

cations (f). Including the enhanced Amdahls law and architecture models of
specialized accelerators, Navigo can be used to compare homogeneous multi-core
designs with designs that include specialized accelerators.
Results show Increasing Effect of Power Constraints Results using Navigo
reveal that performance of multi-core systems will be significantly reduced due
to power constraints. While some designers intuitively understand this result,
our work it is one of the first quantitative presentations of this issue. This result
should serve as a call to action to develop systems with a higher performance-
per-watt.
Analysis for Amount of Specialization By including specialized accelerators
in the model, we use Navigo to select the amount of specialization (both S and
f) required to maintain the performance growth shown in the semiconductor
industry. This analysis gives designers the target amount of area to allocate to
specialization in designs over the next decade.
Accelerator-Based Architecture for Wireless Sensor Networks (Chapter 3)
Holistic Design Informed Through Application and Circuits We built the
PowerTOSSIM to study the power consumption of WSN applications. We used
insights gained from PowerTOSSIM to guide our design of the system architec-
ture.
Accelerator Based Event-Driven Architecture The custom architecture for
WSN includes hardware accelerators for regular tasks, we offloaded event pro-

cessing to a custom hardware component (Event Processor), and we address
leakage power with architecture support for VDD-gating.
Performance Improvements over Mica2 A SystemC model of the architecture
shows a 10x performance improvement over the Mica2 architecture for typical
WSN tasks.
Framework for Process Technology Selection We built a framework to eval-
uate the selection of process technology. We based the framework on a Verilog
model of the architecture and circuit simulations of different process technology
generations. The results show that because of increasing leakage current, the
most advanced process technology node is not the best choice to minimize total
system power consumption.
Silicon Implementation and Evaluation of Accelerator Based Systems (Chapter 4)
Prototype Chip in 130nm CMOS We built a prototype as a case study of
accelerator based architectures. It incorporates synthesized accelerator blocks,
custom VDD-gating circuit, and 2 KB of SRAM for a total of 444,982 transis-
tors.
Functional Verification and Per Block Power Measurements We verified the
prototype for functionality and it functions correctly up to 12.5 MHz at 550
mV. Post layout simulations estimate that the system could run up to 100 MHz
at 1.2V. Measurements of per block power show that VDD-gating saves up to
100x of idle leakage power.

New Metric of Energy per Task and Comparison to Related Work The
traditional metric of energy-per-instruction does not accurately measure an
accelerator-based architecture. Therefore we introduce two new metrics of En-
ergy per Task and Energy per Equivalent Instruction to compare the prototype
to related work. With a measured energy per task of 678.9 pJ and energy per
equivalent instruction of 0.44 pJ this system is the lowest energy processor cur-
rently available for WSNs.
Analysis of Accelerator Speedup and Energy Savings We isolate the benefits
of accelerator based computing by comparing hardware and software implemen-
tations of the routines expressed by the accelerators. The results show a 15x to
635x performance speedup and a 10x to 600x energy savings, depending on the
routine.
Comparison to General Purpose designs through Workload Analysis with Volt-
age and Frequency Scaling (VFS) We compare our system against a general
purpose design while sweeping workload intensity. Voltage and frequency scal-
ing and VDD-gating are included in the analysis. The results show that the
architecture is well-suited for low duty cycle applications and at the same time
can provide more performance for high intensity workloads than general purpose
designs.
This work provides both a high-level justification for accelerator-based architec-
tures and a case study built from the ground up. The work concludes with a discus-
sion of some of the open research questions in this area and a description of current
research efforts.

Chapter 2
Navigo: A Model to Study
Power-Constrained Architectures
and Specialization
Contents2.1 Navigo: A Model for Performance Trends in Future
Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Modeling Methodology and Sample Libraries . . . . . . . . 17
2.2 Power-constrained Performance for Multi-core . . . . . . 23
2.2.1 Results without Power Constraints . . . . . . . . . . . . . . 24
2.2.2 Results with Power Constraints . . . . . . . . . . . . . . . . 26
2.3 Validating the Model . . . . . . . . . . . . . . . . . . . . . 31
2.4 Modeling Specialization . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Variant of Amdahls Law for Specialization . . . . . . . . . 35
2.4.2 Examples of Specialized Cores . . . . . . . . . . . . . . . . 38
2.5 Model Limitations and Future Directions . . . . . . . . . 42
13

Chapter 2: Navigo: A Model to Study Power-Constrained Architectures andSpecialization 14
Given the technology scaling trends and market requirements presented in Sec-
tion 1.1, it is important for chip architects to understand the limitations of homoge-
neous parallelism and to consider more radical architectural approaches. This chapter
presents Navigo, a model that incorporates technology scaling effects to predict future
power-constrained performance trends. Navigo can be used to predict, for a variety
of processor cores, circuit parameters, and market segments, performance trends and
shortfalls from the historical growth rate. Future designs that seek to bridge this gap
must more effectively utilize switching events through specialized hardware. Special-
ization hardware can take many forms [11, 29, 36, 38] including programmable SIMD
units, hardcoded ASIC cores, or reconfigurable logic, and Navigo includes a general
analytical model that can capture the impact of parallel specialization on power-
constrained performance gains. This model projects the amount of specialization,
quantified in terms of several parameters, that will be required in future technology
generations to meet the historical performance scaling trends.
In addressing the problem of power-constrained performance scalability, the chap-
ter makes the following contributions:
We describe Navigo (Section 2.1), a model incorporating technology scaling,
circuit design parameters, and architectural design decisions into a high-level
model to facilitate understanding the impact of power-constrained performance.
We use Navigo to understand a large design space of input parameters (Sec-
tion 2.2).
We extend Navigo to model parallelizable specialization hardware (Section 2.4),

introducing additional parameters to quantify specialization benefits and power/area
costs. This model demonstrates that in order to maintain historical performance
growth, we must increase the amount of specialization for each technology gen-
eration.
2.1 Navigo: A Model for Performance Trends in
Future Technologies
Trends in process technology scaling, predicted by the International Technology
Roadmap for Semiconductors (ITRS), consider a variety of factors that affect the
performance scalability of future computing systems. Designers can no longer rely on
the next technology node to increase circuit performance and reduce energy consump-
tion. Constant-field scaling (or Dennard scaling [64]) has run out with limits imposed
on how aggressively one can reduce transistor threshold voltages (Vth) and supply
voltage (Vdd). The dramatic increase in leakage current has effectively flattened out
Vth scaling for planar CMOS technologies such that supply voltage scaling has also
slowed down. While technology continues to reduce transistor size, wire parasitics
are getting worse after a short respite gained by moving to copper. Lastly, the power
ceiling imposed by cooling costs and battery life further limit performance gains tra-
ditionally offered by technology scaling. In short, the landscape of processor design
has changed dramatically since the end of the twentieth century. It is imperative to
arm future designers with tools that can navigate through the complex interactions of
future process technology scaling trends on architectural and circuit design choices,

Process Technology(ITRS)
Circuits(HSPICE)
Architecture(General purpose and
specialized cores)
User-defined Inputs:Technology nodeVdd (nominal,min)Frequency# of cores and typeMarket selection
NavigoMarket constraints(Server, Desktop,
Mobile, WSN, etc.)
Outputs:ThroughputPower
Figure 2.1: Graphical depiction of Navigo. The model accepts library files for pro-cess technology, circuits, architecture, and market segments, and computes total andconstrained power for a set of user-defined inputs such as supply voltage, frequency,etc.
coupled with power budget limitations imposed across different markets segments. To
this end, we present Navigo, a detailed model that incorporates the effects of process
technology, circuits, architecture, and market to predict future processor performance
trends.
This section begins with a high-level overview of Navigo, which outlines the basic
goals and assumptions made. Then, it describes the inner details of the model,
revealing how it can be used by designers in early stages of design to help guide
high-level system and architectural design decisions.
Navigo provides designers with a powerful and flexible tool to navigate the in-
tricate tradeoffs between process technology, circuits, and architecture, in order to
predict their implications on performance in future processor designs. Figure 2.1

presents a high-level graphical representation of Navigo. The model takes in a vari-
ety of input libraries, which quantify detailed parameters corresponding to process
technology, circuit performance, architecture, and market segment constraints. While
each of these libraries can be modified by the user, Navigo includes built-in libraries
based on ITRS technology scaling predictions out to 11nm (available in 2020), pre-
dictive technology models (PTM) [47, 67], IPCs of currently available processor cores
(based on SPECint2006 scores), and high-level power and area constraints for differ-
ent market segments. With the libraries in place, the designer can sweep a variety
of input parameters such as technology node, voltage, frequency, target market, etc.
Navigo then outputs the total system throughput and power. The user can then
refine her design by iterating through different input parameters to meet a specific
throughput and/or power target.
2.1.1 Modeling Methodology and Sample Libraries
An engine that takes the various libraries and input sweep parameters to calcu-
late throughput and power consumption is at the core of Navigo. This engine must
consider a variety of factors such as the number and characteristics of computational
blocks (i.e. cores), voltage and frequency scaling, wire loading, leakage power, and
process technology, all constrained by power budget limitations. All of these factors
are quantified by the different library parameters.
The process technology library quantifies several parameters and characteristics
utilized by Navigo, which are listed in Table 2.1. These parameters set the basic
device and wire characteristics that Navigo uses to determine circuit speed, power,

Year of Production 2007 2010 2013 2016 2019 2022Planar Bulk Double Gate
Approximate node (nm) 65 45 32 22 16 11Supply Voltage (V) 1.1 1.0 0.9 0.8 0.7 0.65Physical Gate Length (nm) 25 18 13 9 6.3 4.5Id sat (uA/um) 1211 1807 2204 2627 2768 2786Intrinsic delay (ps) 0.64 0.46 0.26 0.15 0.1 0.08Intrinsic switching energy (fJ) 0.064 0.045 0.020 0.0085 0.0037 0.0020RC delay of 1mm wire (ps) 890 2100 4555 10652 23515 58525Die Size-Server (mm2) 310 310 310 310 310 310Number of Transistors (M) 1106 2212 4424 8848 17696 35391
Table 2.1: Predicted Process Technology Characteristics. High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50].
and the number of cores that will be available in future technology nodes. The built-
in process technology library uses published data from ITRS 2007 [47, 67] out to the
11nm technology node anticipated in year 2022. ITRS predicts double gate technology
will supplant planar bulk devices at the 32nm node in year 2013. Because ITRS is a
predictive roadmap based on current projections of technology, it is well-known that
the semiconductor industry has a history of either under- or out-performing ITRS.
For example, Intels technology roadmap is more aggressive with processors at the
45nm node already shipping and plans to introduce processors on the 32nm node in
late 2009. Hence, this library can be readily modified by the user to better reflect
updated ITRS projections or propriety information if available. Table 2.2 compares
technology trends up to 1999 described by Borkar [2] to ITRS 2007 predictions, which
reveals a divergence in power density. This departure from traditional constant-field
scaling affects frequency and voltage scaling in future designs, which we thoroughly
explore in Section 2.2. Throughout the rest of this chapter, we rely on technology

SourceTransistor Energy per Active Area Power
Delay Switch Power DensityBorkar99 [2] 0.70 0.34 0.49 0.49 1.00
ITRS07 (average) [50] 0.67 0.51 0.76 0.50 1.53
Table 2.2: Technology Scaling Factors. High-Performance Microprocessor Logic.Indicates a departure from historical scaling trends resulting in an increase in powerdensity. [50]
predictions made by ITRS 2007.
The circuits library utilizes predictive technology models (PTM) [47, 67], available
from the 45nm node down to 16nm, to model how power and frequency scale with
supply voltage and different amounts of wire parasitics. In the absence of detailed cir-
cuit blocks that can be simulated, we rely on HSPICE simulations of fanout-of-4 ring
oscillators across the technologies to determine basic frequency, power, and voltage
trends. We combine ITRS predictions with PTM-based simulations to extrapolate
trends at the 11nm node. These trends allow Navigo to scale voltage and frequency to
meet different power budgets. It is also important to consider the effects of imposing
minimum voltage (VddMIN) constraints since allowing arbitrary reductions in supply
voltage can lead to a variety of issues related to six transistor SRAM cell instability
issues [65] and exacerbation of on-chip voltage noise. Again, the circuits library can
be modified by the user to model specific blocks if available.
The architecture library contains a collection of processor cores that the user can
choose to tile together in future multi-core systems. The built-in architecture li-
brary consists of three cores currently in production, listed in Table 2.3. These cores,
Intel Xeon (Netburst), Intel Core2Duo (Core), and Intel Atom, represent high-end
server, desktop, and mobile CPUs. We plan to include analysis for processors such

ProcessorTech Die Cores Vdd Freq Power IPC(nm) Size (V) (GHz) (W) (SPEC06
(mm2) /GHz)
Intel Xeon 65 435 2 1.25 3.4 110 3.72(Tulsa) [18]
Intel Core2Duo 45 107 2 1.36 3 65 6.82(Wolfdale)
Intel Atom [12] 45 25 1 1.0 2.0 2.0 2.35
Table 2.3: Example Cores used in analysis. Data collected from conference andjournal publications and datasheets. SPEC2006 results used to determine IPC arefrom spec.org.
as Intels Core i7, as detailed information becomes available. Parameters for the
processors were obtained from publications and SPEC scores in spec.org for Xeon
and Core2Duo. Since official SPEC results are not available for Atom, we extrap-
olate based on benchmark comparisons between Atom and an Athlon with known
SPEC scores [57]. While different processors have been implemented with different
technologies, the power, performance, and area of each core is appropriately scaled by
Navigo utilizing the process technology and circuits trends prescribed by their respec-
tive libraries. The user is not constrained by these cores, but can also include other
user-defined cores into the architecture library. For example, Section 2.4 explores the
impact of specialized cores.
The market segment library identifies different market segment targets that con-
strain total area and maximum power. Table 2.4 lists examples of different market
segments. Throughout the rest of the chapter, we focus on two particular mar-
ket segmentsserver and mobile. The server market allows for a maximum area of
300mm2 and maximum power of 198W as defined by ITRS. In contrast, the mobile

Market Max Power (W) Die Area (mm2)MPU-CP Cost and Performance 151 140
MPU-HP High Performance 198 310MPU-PCC Power Cost and Connectivity 3 70
Desktop-95 95 100Desktop-65 65 100
Mobile Standard Voltage 35 100Mobile Ultra-low Voltage 10 100
Table 2.4: Market Segment Constraints. Die size and Max Power Consumptionfor a set of market segments. Values for the first three markets came from ITRS [50].The final four market segments are based on die size and thermal design point ofcommercially available Intel Processors.
market allows for a maximum area of 100mm2 and maximum power of 35W. Again,
different markets segments and/or constraints can be easily defined by the user via
changes to the library.
Finally, Navigos engine computes total throughput as follows:
Throughput = Ncores freq(V dd, tech) IPCcore (2.1)
where the number of cores, Ncores, is defined by the total die size (for a target market
segment) divided by the core chosen and scaled by technology node. The IPC of each
core can be derived from published (or simulated for new cores) SPEC benchmark
results and clock frequency of the core. Operating frequency depends both on process
technology and voltage, and is calculated based on the original frequency published
for the core. First, Navigo calculates the maximum frequency of the core for nominal
voltage in the new technology. We incorporate both the intrinsic switching delay of the
transistor and effects due to wire delay scaling. We scale logic and wires independently
because the projected trends follow competing directions and are modeled separately

in ITRS.
freqV ddNom=freqcorebasetech
fraclogicfreqswitchtech
freqswitchbasetech+fracwire
freqwiretechfreqwirebasetech
(2.2)
where basetech is the original technology in which the core was fabricated. The nom-
inal frequency is then multiplied by PTM-based scaling factors to calculate voltage-
specific frequencies.
Power depends on voltage, operating frequency, and the transistor switching rate
of the architecture. We model average power with the following expression:
Pavg = Pactive + Pleak freq (Eswitch Nswitching + Ewire) + Pleak (2.3)
Traditionally power consumption is modeled as a sum of active power and leak-
age power. Navigo computes active power as a sum of the number of transistor
switches per second multiplied by the energy per switch. We calculate switching rate
(Nswitching) from published frequency and power numbers. Since energy per switch
(Eswitch) is technology dependent, it scales based on voltage-dependent scaling factors
derived from HSPICE simulations for each technology node. Wires scale differently
from transistors and, hence, are separately accounted for. We assume leakage power
remains a fixed percentage of the total power consumption at maximum frequency
and nominal voltage, which then scales with respect to different operating voltage
levels. In order to accommodate different power budgets prescribed by different mar-
ket segments, Navigo iterates through voltage and frequency settings until a specific
power target is met. When the model encounters a VddMIN constraint, it scales
frequency only to reduce power at the expense of inefficient energy usage.

While Navigo seeks to combine a variety of factors to accurately predict future
performance, it makes several optimistic assumptions. First, it may not be feasi-
ble to fit an integer number of cores into a predefined area. Hence, we allow for
half-size cores with IPC and power that scale linearly by one half. Although this sce-
nario is infeasible, for near-term technologies (e.g. 45nm), large area cores introduce
quantization effects that make it difficult to observe consistent trends. This effect
becomes significantly less important as we scale to more advanced technologies. Sec-
ond, future multi- and many-core systems will face a variety of challenges to enable
core-to-core communications. Navigo optimistically assumes a perfect on-chip inter-
connection network. Lastly, and perhaps most important, we assume workloads can
be fully parallelized to keep all cores running continuously. Hence, the model is or-
thogonal to Hills investigation that compares single-threaded versus multi-threaded
parallelism [27]. One of the main objectives of developing Navigo was to provide a
detailed and yet flexible model to help designers predict performance trends and guide
future designs. Moreover, we use Navigo to show that despite optimistic assumptions
of perfect thread parallelism that are run on highly-parallel many-core designs, power
constraints will hamper performance growth and motivate designers to seek out new
solutions beyond simply increasing the number of cores on a die.
2.2 Power-constrained Performance for Multi-core
Navigo can be used to understand power-constrained performance scalability across
technology generations. In this section, we demonstrate the utility of Navigo by ex-
ploring the scalability of three classes of CPU architectures when considering power-

constrained market segments (Table 3) and the impact of the minimum supply voltage
constraint.
For each of these explorations, we make several assumptions. First, we assume
that area and power will be fixed by the market segment. More advanced technology
nodes provide an increase in the number of available transistors leading to a doubling
of available cores per technology generation; however, frequency benefits will be con-
strained by power limits. If the power budget is exceeded for a given number of cores
and clock frequency, we scale voltage and frequency down to meet the power bud-
gets, subject to circuit constraints on the supply voltage, after which linear frequency
scaling is utilized.
2.2.1 Results without Power Constraints
To understand the impact of power constraints on scaling, we first consider the
scenario where power is not a design constraint. Figure 2.2 illustrates this figure with
four sub-figures illustrating various outputs of the model when scaled across technol-
ogy nodes for a fixed area budget of 310 mm2. The four sub-figures quantify, across
the three core types, the number of cores, clock frequency, total power, and total
chip throughput. Without power constraints, all metrics scale up with technology.
Figure 2.2(a) shows that the number of Core2Duo cores starts at around 6 in the
45nm node (recall that core count is scaled to meet the 310 mm2 budget), scaling
to 93 cores by 11nm. Without power limitations, frequency scaling continues un-
abated surpassing 19.12 GHz for the Xeon core in 11nm, but this comes at the price
of increased power dissipation, exceeding a kilowatt in the worst case. Figure 2.2(d)

45 nm 32 nm 22 nm 16 nm 11 nm0
20
40
60
80
100
120
140
160
180
200
Technology
Num
ber
of C
ores
AtomXeonCore 2
(a) Number of cores
45 nm 32 nm 22 nm 16 nm 11 nm0
2
4
6
8
10
12
14
16
18
20
TechnologyF
req
(GH
z)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm0
500
1000
1500
Technology
Tot
al P
ower
(W
)
AtomXeonCore 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
105
Technology
Thr
ough
put
AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year (Core2)
(d) Throughput
Figure 2.2: Results without power constraints across process technologies.Results assume nominal voltage for specified technology and MPU-HP market segmentwith a die size of 310 mm2.

plots total chip throughput relative to the Core2Duo from the 45nm technology node,
as calculated by increasing the core count along with frequency improvement. The
throughput improvement increases at a slightly lower rate than the historical growth
rate of 1.58x. This shows that if power is not a constraint, performance growth could
be achieved through a combination of traditional frequency scaling and multi-core
design.
2.2.2 Results with Power Constraints
Incorporating power constraints into our analysis gives a true picture of expected
trends in future technologies. We show that for market segments that tolerate higher
power density systems, scaling trends are better compared to more constrained market
segments. In this section, we compare the server market segment, which uses the same
310 mm2 die with a power limit of 198W, and the mobile market segment, which uses
a 100 mm2 die with a power limit of 35W. Figure 2.3 and Figure 2.4 plot the server
and mobile market segment scalability analysis across the three core types. Each
plot shows the required supply voltage, clock frequency, total power, and total chip
throughput.
Focusing on the results for the server market segment, we observe several impor-
tant trends. For the Intel Xeon design, power is already constrained at the 45nm
technology node, and the design must reduce supply voltage from nominal in order
to meet the power goal. When moving to the 32nm node, the Xeon is able to achieve
a small frequency increase by operating at the minimum supply voltage. Beyond
32nm, the Xeon frequency reduces slightly and then flattens out as the power budget

45 nm 32 nm 22 nm 16 nm 11 nm0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
Technology
VD
D (
V)
AtomXeonCore 2
(a) Vdd
45 nm 32 nm 22 nm 16 nm 11 nm0
2
4
6
8
10
12
14
16
18
20
TechnologyF
req
(GH
z)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm20
40
60
80
100
120
140
160
180
200
220
Technology
Tot
al P
ower
(W
)
Atom
Xeon
Core 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
105
Technology
Thr
ough
put
(d) Throughput
Figure 2.3: Results with power constraints across process technologies -Server. Results assume nominal voltage for specified technology and MPU-HP marketsegment with a die size of 310 mm2 and max power of 198 W.

45 nm 32 nm 22 nm 16 nm 11 nm0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
Technology
VD
D (
V)
AtomXeonCore 2
(a) Vdd
45 nm 32 nm 22 nm 16 nm 11 nm1
2
3
4
5
6
7
TechnologyF
req
(GH
z)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm10
15
20
25
30
35
40
Technology
Tot
al P
ower
(W
)
AtomXeonCore 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
Technology
Thr
ough
put
(d) Throughput
Figure 2.4: Results with power constraints across process technologies -Mobile. Results assume nominal voltage for specified technology and Mobile marketsegment with a die size of 100 mm2 and max power of 35 W. Vdd is limited toVddMIN.

is soaked up by additional cores. In contrast, the Intel Core2Duo design allows full
frequency scaling until the 22nm technology node, after which scaling is curtailed;
in 11nm, frequency must be throttled when adding more cores. The Intel Atom
core is much more power-efficient and can continue to scale frequency until 11nm,
with additional power headroom. However, Atom starts with a significant perfor-
mance disadvantage compared to Core2Duo, and hence by 11nm, the Core2Duo and
Atom roughly converge on total throughput. In 11nm, the best designs (Atom and
Core2Duo) are increasing at a rate of 1.35x per year, which by 11nm is nearly 6.6x
below the 1.58x per year curve.
The mobile market segment, seen in Figure 2.4 exhibits similar trends, but the
tighter power constraints result in more severe reductions in clock frequency, and
slowing in overall per-year throughput growth. For example, the Core2Duo hits a
frequency cap around 32nm, and frequency flatlines until 16nm when it slightly dips.
Even the Atom processor power caps at 16nm, after which frequency also dips to
maintain the power budget.
An important issue that we see repeatedly throughout the above scenarios is the
minimum Vdd constraint is met as we seek to fit designs with many cores into fixed
power budgets by reducing voltage and clock frequency. When a design reaches
this constraint, additional power reduction can only be achieved through inefficient
frequency-scaling essentially linear reduction in clock frequency offsets additional
cores. Practically speaking, designers may prefer to simply stop scaling the num-
ber of cores in a system at this point. In order to understand this effect, we have
run additional simulations with the constraint removed; the results are shown in

45 nm 32 nm 22 nm 16 nm 11 nm0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Technology
VD
D (
V)
AtomXeonCore 2
(a) Vdd
45 nm 32 nm 22 nm 16 nm 11 nm1
2
3
4
5
6
7
Technology
Fre
q (G
Hz)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm10
15
20
25
30
35
40
Technology
Tot
al P
ower
(W
)
AtomXeonCore 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
Technology
Thr
ough
put
(d) Throughput
Figure 2.5: Results with power constraints across process technologies with-out VddMIN constraints - Mobile. Results assume nominal voltage for specifiedtechnology and Mobile market segment with a die size of 100 mm2 and max power of35 W. Vdd can be reduced without a lower limit.

Figure 2.5. We significantly reduce VDD to meet the power constraints set by the
market, as low as 0.6V for in advanced technologies and Xeon and Core2Duo mi-
croarchitectures. There is a clear loss in throughput for systems under minimum
VDD constraints, Figure 2.4 (d), compared to systems without minimum VDD con-
straints, Figure 2.5 (d). For the Atom processor, minimum Vdd is not a severe issue.
For the mobile market segment in the 11nm node, scaling VDD reduces throughput
by 13.4%. However, the minimum voltage constraint reduces the throughput of the
Xeon core by 57.6% for the same target. Even without this constraint, the Xeon still
performs poorly compared to the more power-efficient cores, because running at very
low voltage does not provide ideal performance.
2.3 Validating the Model
This section presents a back-validation of Navigo for microprocessors built from
1996 to 2007. Because of the predictive nature of the model, it is difficult to validate
Navigos predictions of the power and performance of microprocessors built using
future process technologies. Therefore, we validate the Navigo based on an initial
data-point from 1996 against Microprocessors manufactured over the last 10 years.
For validation, we seeded the microarchitecture library with the DEC Alpha 21164
microprocessor, introduced in 1996 and manufactured in 350nm technology. We de-
veloped the technology and circuits library based on ITRS data from 1997 to 2007
and circuit simulation results, using SPICE models from industry and PTM. In 1997,
the ITRS committee did not anticipate the growth in power density that started with
the 180nm technology node. Therefore, for each node, we chose the technology model

CPU Year Node(nm)
DieSize(mm)
Throughput Freq(GHz)
Power(W)
Alpha 21164 1996 350 210 481 0.5 31Alpha 21164 1997 350 141 649 0.6 40Alpha 21264 1998 350 314 993 0.6 73Alpha 21264A 1999 250 210 1267 0.7 85Pentium III 2000 180 106 1779 1.0 29Athlon 2001 180 130 2584 1.6 68Pentium 4 2002 130 146 4195 3.0 81.8Opteron 2003 130 193 5364 2.2 89Xeon 2004 130 237 5764 3.6 9264-bit Xeon 2005 90 81 6505 3.6 110Core 2 Extreme 2006 65 143 17909 2.93 75Xeon 3085 2007 65 143 23207 3 65POWER6 2007 65 341 35071 4.7 180
Table 2.5: Select Microprocessors from 1996 to 2007. Performance data is fromthe analysis in Figure 1.1. Power consumption and die size data was acquired fromdatasheets and published microprocessor reports.
from the ITRS year closest to date of introduction. This technique isolates the error
in ITRS predictions from the modeling framework.
We compare predictions from Navigo with microprocessors manufactured between
1996 and 2007, shown in Table 2.5. We calculate throughput from the same Hen-
nesey and Patterson and SPECint2006 benchmark data used to develop Figure 1.1,
described in Section 1.1.1. We gathered power consumption data from datasheets
and online microprocessor reports. The die size of the microprocessors vary widely;
therefore, we compare throughput per unit area and power per unit area.
Figure 2.6 (a) presents a comparison of throughput per unit area predicted with
Navigo and the throughput of commercially available microprocessors. The x-axis
represents both technology node and year of introduction. The throughput of the

350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006 10
0
101
102
103
Technology
Thr
ough
put/A
rea
NavigoCommercial Microprocessors
(a) Throughput
350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006 0
100
200
300
400
500
600
700
800
Technology
Pow
er/A
rea
(mW
/mm
2 )
NavigoCommercial Microprocessors
(b) Power
Figure 2.6: Validation of Navigo using Microprocessors from 1996 to 2007.Predicted results use the most recent ITRS technology models. The initial core modelis an Alpha 21164 0.5 GHz in 250nm technology introduced in 1996. The data pointsrepresenting commercially available systems are also presented in Figure 2.5
initial core, Alpha 21164 0.5 GHz, matches the predictions from Navigo which reveals
the absence of static offset errors in the model. The throughput predicted by Navigo
aligns well with the results from the benchmarked microprocessors. Generally, Navigo
estimates the upper bound of throughput per unit area. To combat increasing power
consumption, designers of microprocessors in the 65nm node slowed the scaling of
clock frequency and choose to design multi-core processors made of simpler cores.
Navigo overestimates the throughput of multi-core designs because it assumes that
the costs of communication and thread synchronization are zero.
While Navigo predicts a general trend of increased power density, shown in Fig-
ure 2.6 (b), it does not predict the drastic jump in power consumption caused by
changes in microarchitecture, as it assumes a fixed core design. During the period be-

tween 1997 and 2005, microarchitects aggressively pursued single-thread performance
resulting in several high-throughput and high-power consumption designs. The deeply
pipelined Netburst microarchitecture, manufactured in 130nm (Pentium 4 and Xeon),
had notoriously high power consumption. Subsequently, the industry changed course
and introduced more power efficient multi-core designs. The power consumption pre-
dicted by Navigo matches the initial core Alpha 21164 in 350nm. Navigo also aligns
well the multi-core designs in the 65nm node, which utilize cores that have microar-
chitectures similar to the Alpha. The model correctly shows the transition between an
earlier erawhen constant field scaling was still possible and power density remained
constant (350nm, 250nm, 180nm)and the current era of increasing power density.
Our back-validation shows that Navigo predicts throughput well and points out
general trends in power consumption. Navigo incorporates a static model of mi-
croarchitecture, and thus for a more accurate prediction of power consumption, users
should include cores in their libraries which best represent their target core design.
2.4 Modeling Specialization
Consistent progress towards smaller, faster, and more numerous transistors with
each generation of process technology no longer yields the steady growth in comput-
ing performance enjoyed throughout the 20th century. The power ceiling forced a
right-hand turn in single-thread performance and CPU designers have been rac-
ing to implement multi-core systems ever since. Unfortunately, Navigo predicts that
even for the server market segment, multi-core scaling will only yield a 1.35x/year
performance growth trend. In order to get back onto the 1.58x growth trend, design-

ers must maximize the efficiency of transistor (and wire) switching. In other words,
designers must minimize the overheads associated with a general-purpose (GP) CPU.
One obvious direction is to replace general-purpose computing with dedicated, spe-
cialized hardware that offers higher computation per unit area and power, for an
increasing fraction of the machines workload. IBMs CELL processor is one such
example. It includes 8 SPEs, which are specialized cores used to speed up SIMD
workloads [11]. Similarly graphics processing units (GPUs) have been used exten-
sively by programmers to speedup tasks related to video processing and other SIMD
operations. Another example may be to introduce dedicated hardware specialized to
H.264 decoding. In order to understand the potential benefits of specialization, this
section introduces a parallel-variant of Amdahls Law for specialization. Then, by
augmenting Navigo with specialization, we project the amount of specialization that
will be required in future computing systems to increase system throughput by 1.58x
per year.
2.4.1 Variant of Amdahls Law for Specialization
Amdahls Law is commonly used to describe the theoretical limitations of appli-
cation speedup given constraints on the fraction of the workload that can be sped
up.
Speedupenhanced(f, S) =1
(1 f) + fS
(2.4)
where f is the fraction of the workload that can be enhanced and S is amount of
speedup possible through enhancements. Amdahls Law has been adapted to model
symmetric and asymmetric multi-core systems [27], where parallel cores can execute

(a) Calculation framework
101
100
100
102
100
101
102
103
fraction of workload (f)Speedup (S)
Thr
ough
put (
norm
aliz
ed)
(b) Throughput vs f and S
Figure 2.7: Speeding up an application with specialized cores. A workloadis split to an additional set of resourcesthe specialized core. The fraction of theapplication that can be executed on the specialized core is f , with a speedup of S.
all workloads. With specialized cores, we must make a few assumptions in order to
model speedup using Amdahls Law. First, we assume special-purpose (SP) cores
can only run specific parts of an application (f) while general-purpose cores can run
the entire workload, albeit with lower efficiency. Second, we optimistically assume
that workloads are arbitrarily parallelizable (also previously assumed in Navigo). T

Accelerator-Based Architectures for Wireless Sensor Network ...

Documents

Software Defined Networks in Wireless Sensor Architectures

Electrical Failure of an Accelerator Pedal Position Sensor.....

Test Accelerator for Service Oriented Architectures … ·....

Sensor Network Architectures

A Taxonomy of Sensor Network Architectures

DNN Accelerator Architectures

Accelerator integration in heterogeneous architectures

Protocols and Architectures for Wireless Sensor Netwoks

Sensor Network Architectures for Monitoring Underwater...

Architectures and Applications for Wireless Sensor Networks....

Low-Power Event-driven Image Sensor Architectures - · PDF.....

Area and Energy efficient CORDIC Accelerator for …...

Multi Sensor Data Fusion Architectures for Air Traffic...

A SURVEY OF WIRELESS SENSOR NETWORK ARCHITECTURES · 2017.....

High performance sensor interfaces: Efficient system...

Simulation Modelling Practice and...