-
Accelerator-Based Architectures for WirelessSensor Network
Applications
A dissertation presented
by
Mark David Hempstead
to
School of Engineering and Applied Science
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Engineering Sciences
Harvard University
Cambridge, Massachusetts
May 2009
-
c2009 - Mark David Hempstead
All rights reserved.
-
Thesis advisor Author
David Brooks and Gu-Yeon Wei Mark David Hempstead
Accelerator-Based Architectures for Wireless Sensor Network
Applications
Abstract
Growing power consumption threatens the explosive growth that
the semiconductor
industry has sustained over the last several decades. While the
number of transistors
continues to double every process technology generation, the
slowing of constant field
scaling has caused power density to increase limiting clock
frequency. To combat these
trends, designers must get more performance from each transistor
switch. Technology
companies are applying microprocessors to a growing diversity of
applications that
are increasingly mobile and untethered from the power grid. One
such domain is
the emerging area of wireless sensor networks (WSNs) where,
because nodes are
often deeply embedded in an environment, power consumption is
the primary design
constraint.
This dissertation explores the challenges of designing in a
power-constrained era
through the development of a model we call Navigo and the design
and implemen-
tation of an accelerator-based architecture for WSNs. We
designed Navigo to aid in
early architecture exploration as an alternative to the
spreadsheets and back-of-the-
envelope calculations that planners use to guide future designs.
The results show
that, even under ideal conditions, multicore processors will not
achieve the perfor-
mance gains necessary to maintain growth. This dissertation
shows that if an increas-
iii
-
Abstract iv
ing amount of area per technology node is allocated to
specialized accelerators, then
microprocessor performance growth will be maintained.
As a case study of accelerator-based architectures, we developed
a processor for
WSNs. Our architecture includes accelerators for regular tasks
and event handling is
offloaded to the event processor, removing the software overhead
of a general purpose
design. Because the architecture is modular, VDD-gating can be
employed to address
leakage current at the architecture level. We built a prototype
in 130nm CMOS. We
compare our system to other systems in the literature and a
general purpose-based
design. Our system has the lowest energy per equivalent
instruction and results of
our workload analysis shows the system is suited both for
low-intensity and high-
performance WSN applications.
-
Contents
Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . iAbstract . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . vList of
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . xiiCitations to Previously Published
Work . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . xviii
1 Introduction and Summary 11.1 Motivation . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . .
. . . . 31.1.2 Market Requirements . . . . . . . . . . . . . . . .
. . . . . . . 4
1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 61.3 Accelerator-based Architectures . . . . . . .
. . . . . . . . . . . . . . 71.4 Summary of Contributions . . . . .
. . . . . . . . . . . . . . . . . . . 8
2 Navigo: A Model to Study Power-Constrained Architectures
andSpecialization 132.1 Navigo: A Model for Performance Trends in
Future Technologies . . . 15
2.1.1 Modeling Methodology and Sample Libraries . . . . . . . .
. . 172.2 Power-constrained Performance for Multi-core . . . . . .
. . . . . . . 23
2.2.1 Results without Power Constraints . . . . . . . . . . . .
. . . 242.2.2 Results with Power Constraints . . . . . . . . . . .
. . . . . . 26
2.3 Validating the Model . . . . . . . . . . . . . . . . . . . .
. . . . . . . 312.4 Modeling Specialization . . . . . . . . . . . .
. . . . . . . . . . . . . 34
2.4.1 Variant of Amdahls Law for Specialization . . . . . . . .
. . . 352.4.2 Examples of Specialized Cores . . . . . . . . . . . .
. . . . . . 38
2.5 Model Limitations and Future Directions . . . . . . . . . .
. . . . . . 42
v
-
Contents vi
3 An Ultra Low Power Event Driven Architecture for WSNs 463.1
Background and Motivation . . . . . . . . . . . . . . . . . . . . .
. . 48
3.1.1 Overview of WSN Applications . . . . . . . . . . . . . . .
. . 483.1.2 PowerTOSSIM Modeling Commercially Available Systems
for
WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 523.1.3 Low-Power Circuit Design Techniques . . . . . . . . . .
. . . . 573.1.4 Energy Scavenging . . . . . . . . . . . . . . . . .
. . . . . . . 61
3.2 Goals of the Architecture . . . . . . . . . . . . . . . . .
. . . . . . . . 623.3 Architecture Description . . . . . . . . . .
. . . . . . . . . . . . . . . 64
3.3.1 System Bus Description . . . . . . . . . . . . . . . . . .
. . . 663.3.2 Event Processor Specification . . . . . . . . . . . .
. . . . . . 683.3.3 Description of Accelerators and Other Blocks .
. . . . . . . . 70
3.4 Architecture Evaluation . . . . . . . . . . . . . . . . . .
. . . . . . . 743.4.1 Performance Modeling - SystemC Simulator . .
. . . . . . . . 753.4.2 Test Application . . . . . . . . . . . . .
. . . . . . . . . . . . 753.4.3 Cycle Performance Estimates . . . .
. . . . . . . . . . . . . . 78
3.5 Selection of Process Technology . . . . . . . . . . . . . .
. . . . . . . 793.5.1 Background on Technology Scaling . . . . . .
. . . . . . . . . 803.5.2 Simulation Study . . . . . . . . . . . .
. . . . . . . . . . . . . 813.5.3 Modeling Architecture Across
Process Technologies . . . . . . 853.5.4 Results of System Analysis
. . . . . . . . . . . . . . . . . . . . 89
4 Silicon Implementation and Evaluation of Accelerator Based
Sys-tems 994.1 Implementation Details . . . . . . . . . . . . . . .
. . . . . . . . . . . 101
4.1.1 Design Flow and Tools Used . . . . . . . . . . . . . . . .
. . . 1014.1.2 VDD-gate circuit . . . . . . . . . . . . . . . . . .
. . . . . . . 1024.1.3 Die-Photo and Test Chip Specifications . . .
. . . . . . . . . . 103
4.2 Measurements of Prototype . . . . . . . . . . . . . . . . .
. . . . . . 1044.2.1 Test Methodology and Setup . . . . . . . . . .
. . . . . . . . . 1054.2.2 Functional Verification . . . . . . . .
. . . . . . . . . . . . . . 1064.2.3 Block Level Power Measurements
. . . . . . . . . . . . . . . . 1084.2.4 Energy per Task and Energy
per Instruction . . . . . . . . . . 110
4.3 Comparison to Related Work . . . . . . . . . . . . . . . . .
. . . . . 1124.3.1 Categorization and Description of Similar
Systems . . . . . . . 1124.3.2 Summary and Comparison . . . . . . .
. . . . . . . . . . . . . 113
4.4 Comparison to General Purpose Microcontroller . . . . . . .
. . . . . 1154.4.1 Performance and Energy Benefits of
Specialization . . . . . . . 1164.4.2 Workload Analysis and DVFS .
. . . . . . . . . . . . . . . . . 119
4.5 Using Navigo to Guide Future Revisions . . . . . . . . . . .
. . . . . 122
-
Contents vii
5 Conclusion and Future Directions 1265.1 Summary of Themes and
Results . . . . . . . . . . . . . . . . . . . . 1275.2 Future Work
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
5.2.1 Improved Modeling Frameworks . . . . . . . . . . . . . . .
. . 1295.2.2 Memory Systems for Accelerator-Based platforms . . . .
. . . 1305.2.3 Applying Accelerator-Based Architectures to
Desktop/Mobile
platforms . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 131
Bibliography 134
A Related Work: Description of Similar Systems 141A.1 General
Purpose Commodity Based Systems . . . . . . . . . . . . . . 141A.2
Smart Dust - Early Event Driven . . . . . . . . . . . . . . . . . .
. . 142A.3 Subthreshold Systems . . . . . . . . . . . . . . . . . .
. . . . . . . . . 144A.4 Asynchronous - SNAP . . . . . . . . . . .
. . . . . . . . . . . . . . . 146A.5 Charm - Network Stack
Acceleration . . . . . . . . . . . . . . . . . . 148
B Detailed Design Documents 150B.1 System Bus Signals . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 150B.2 Memory Map . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.3
Interrupt Map . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 153B.4 Power Domains . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 154
-
List of Figures
1.1 Growth in Microprocessor Performance. Historically the
indus-try has observed a total 1.58x performance gain per year.
Power con-sumption constraints inhibit performance growth causing a
gap betweenexpected and delivered performance. Data from Hennessy
and Patter-son [25] and spec.org [54]. . . . . . . . . . . . . . .
. . . . . . . . . . 3
1.2 Research Approach. We take a holistic approach to research
understanding and addressing power consumption at all layers of
thedesign space. Architecture innovations are informed by modeling
andprototyping. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 6
2.1 Graphical depiction of Navigo. The model accepts library
files forprocess technology, circuits, architecture, and market
segments, andcomputes total and constrained power for a set of
user-defined inputssuch as supply voltage, frequency, etc. . . . .
. . . . . . . . . . . . . . 16
2.2 Results without power constraints across process
technolo-gies. Results assume nominal voltage for specified
technology andMPU-HP market segment with a die size of 310 mm2. . .
. . . . . . . 25
2.3 Results with power constraints across process technologies
-Server. Results assume nominal voltage for specified technology
andMPU-HP market segment with a die size of 310 mm2 and max powerof
198 W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 27
2.4 Results with power constraints across process technologies
-Mobile. Results assume nominal voltage for specified technology
andMobile market segment with a die size of 100 mm2 and max power
of35 W. Vdd is limited to VddMIN. . . . . . . . . . . . . . . . . .
. . . 28
2.5 Results with power constraints across process
technologieswithout VddMIN constraints - Mobile. Results assume
nominalvoltage for specified technology and Mobile market segment
with a diesize of 100 mm2 and max power of 35 W. Vdd can be reduced
withouta lower limit. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 30
viii
-
List of Figures ix
2.6 Validation of Navigo using Microprocessors from 1996 to
2007.Predicted results use the most recent ITRS technology models.
The ini-tial core model is an Alpha 21164 0.5 GHz in 250nm
technology in-troduced in 1996. The data points representing
commercially availablesystems are also presented in Figure 2.5 . .
. . . . . . . . . . . . . . 33
2.7 Speeding up an application with specialized cores. A
workloadis split to an additional set of resourcesthe specialized
core. Thefraction of the application that can be executed on the
specialized coreis f , with a speedup of S. . . . . . . . . . . . .
. . . . . . . . . . . . . 36
2.8 Understanding the impact of specialization on
throughput.Calculations of throughput with specialization for
different speedups (S)and fractions of workload (f). Assumes the
general purpose core isfully utilized and resources for an
additional specialized core has beenprovisioned. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Specialization across process technologies with real SP
cores.Total throughput for different values of f assuming the area
and speedupof one example SP core per GP core. Mobile 35W market
segment. . 40
2.10 Configurations that can achieve 1.58x/year throughput.
Modeltwo different accelerator structures the programmable CELL SPE
andan H.264 accelerator. Core2Duo-based GP cores and the Mobile
35Wmarket assumed. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 42
3.1 Measured and simulated current consumption for the
Beaconapplication. The simulated version includes a breakdown
according toradio, LEDs, and CPU current. A lower resolution
digital multi-meterwas used for the above measurement, which did
not capture the veryshort duration peak power spikes during the
wakeups. . . . . . . . . . 55
3.2 Surge Application Power Consumption Breakdown. 60 sec ofthe
surge TinyOS application run on the Mica2 mote. . . . . . . . .
56
3.3 System Block Diagram. . . . . . . . . . . . . . . . . . . .
. . . . 653.4 Event Processor State Machine . . . . . . . . . . . .
. . . . . . . 683.5 Diagram and Code of the Monitoring Application.
The code
displayed are ISR routines written for the event processor.
Actual ad-dress values have been omitted to make the code easy to
read. . . . . . 76
3.6 Test Circuit Used for Simulations. The circuit consists of
an 11stage ring oscillator made up of an assortment of logic gates.
Inter-connect was modeled between devices. . . . . . . . . . . . .
. . . . . . 81
3.7 Leakage Power, EDP, and Frequency Across all
TechnologiesEach line indicates a technology node from 180nm to
70nm. Supplyvoltage is on the X-axis which was swept from 0.1V to
the max VDDspecific to the process. Temperature is 20C and all
transistors areminimum size. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 83
-
List of Figures x
3.8 Results for Baseline Architecture. Performance target of
N=100sense and transmit tasks. . . . . . . . . . . . . . . . . . .
. . . . . . . 90
3.9 Effect of Energy Reduction Techniques on Total Energy
Con-sumption of the Architecture Across Process Technologies.Power
Supply voltage is limited to V tP + V tN and the number of tasksper
second is 100. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 93
3.10 Summary of Energy Reduction Techniques Across
ProcessTechnologies Each bar represents the minimum energy
calculated fora particular architecture configuration and process
technology. Both thetotal energy consumption and a percentage
breakdown of the source ofenergy consumption are included. . . . .
. . . . . . . . . . . . . . . . 95
4.1 Custom VDD-Gating Circuit. The schematic shows four
differentparallel legs which are used to control VDD-gating
strength. Layout ofthe filter component shows where the VDD-gating
circuit is attached.In this example, the VDD-gating circuit
requires an additional area of3.2%. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 103
4.2 Die Photograph of 130nm Prototype. System includes an
eventprocessor and several accelerators for regular operation. The
systemhas been realized in 130nm CMOS on a 2mm x 2mm die. The
systemcontains 444,982 transistors including 4KB of foundry
supplied SRAM. 104
4.3 Frequency verses Voltage Shmoo. Shaded region of plot
indicateswhere the test failed the unshaded region indicates
successful opera-tion. Results from a full run of the sense and
transmit application wereused to generate a shmoo. Due to
limitations of the test board the chipwas measured up to 12.5 MHz.
The shmoo generated using post layoutsimulations indicate the chip
will work up to 100 MHz . . . . . . . . . 107
4.4 Measured power consumption of the prototype under differ-ent
supply voltages and clock frequencies. Plots a-c show thepower
consumption for the Event Processor, Accelerator, and SRAMpower
domains while sweeping voltage from 450 MV to 800 MV andfrequency
from 25 kHz to 12.5 MHz. Idle power is measured with theexternal
clock off (0MHz @550mV). The VDD-gating transistor is
off(not-conducting) during the measurement of gated power. . . . .
. . 109
4.5 Energy per Task of Sense and Transmit Task. Application
in-cludes all accelerator blocks and power contributions from the
SRAMand Event Processor. . . . . . . . . . . . . . . . . . . . . .
. . . . . . 111
4.6 Comparison to Other Systems Designed for WSN. . . . . . . .
1144.7 Performance and Power Benefits of Specialization. Test
rou-
tines were executed both on the hardware accelerators and the
micro-controller. Cycle count and energy savings are presented. . .
. . . . 118
-
List of Figures xi
4.8 Evaluation of Accelerator-based Architecture vs.
GeneralPurpose System . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 121
4.9 WSN architecture projected to advanced process
technologiesand power budgets. Die size, f , S are fixed to based
on measure-ments of the original system. Area is swept and the
configuration withthe maximum throughput is reported for three
different power budgets. 125
A.1 Smart Dust Microarchitecture[59]. . . . . . . . . . . . . .
. . . . . . . 143A.2 Block Diagram of the Subliminal Processor
(University of Michigan)[51].145A.3 Simplified block diagram of the
SNAP processor for WSN. System
includes separate instruction and data memories, a timer
coprocessor,and a message processor which provides a FIFO interface
to the off-chipradio and sensors[9]. . . . . . . . . . . . . . . .
. . . . . . . . . . . . 146
A.4 The Charm protocol processor microarchitecture[52]. . . . .
. . . . . 148
-
List of Tables
2.1 Predicted Process Technology Characteristics.
High-PerformanceMicroprocessor Technology ITRS 2007 Edition [50]. .
. . . . . . . . . 18
2.2 Technology Scaling Factors. High-Performance
MicroprocessorLogic. Indicates a departure from historical scaling
trends resultingin an increase in power density. [50] . . . . . . .
. . . . . . . . . . . . 19
2.3 Example Cores used in analysis. Data collected from
conferenceand journal publications and datasheets. SPEC2006 results
used todetermine IPC are from spec.org. . . . . . . . . . . . . . .
. . . . . . 20
2.4 Market Segment Constraints. Die size and Max Power
Consump-tion for a set of market segments. Values for the first
three marketscame from ITRS [50]. The final four market segments
are based on diesize and thermal design point of commercially
available Intel Processors. 21
2.5 Select Microprocessors from 1996 to 2007. Performance data
isfrom the analysis in Figure 1.1. Power consumption and die size
datawas acquired from datasheets and published microprocessor
reports. . . 32
2.6 Specialized Cores. Example SP cores used in the model. All
mea-surements were scaled to 65nm technology and speedup was
calculatedby comparing published performance results to the
performance on ageneral purpose CPU. The Core2 is included to show
the relative areaand performance cost of including another GP core
instead of an SPcore. Power and speedup for CELL SPE running
Linpack. . . . . . . 39
3.1 Sensor Sampling Rates of Different Phenomena . . . . . . . .
493.2 Example WSN application domains. . . . . . . . . . . . . . .
. . 503.3 Power model for the Mica2. The mote was measured with
the
micasb sensor board and a 3V power supply. . . . . . . . . . . .
. . . 543.4 Event Processor Instruction Set . . . . . . . . . . . .
. . . . . . 693.5 Comparison of cycle count for the test
application written on
our architecture and on TinyOS for the Mica Platform. . . .
783.6 Scaling Factors From theory and simulation data . . . . . . .
. . . 87
xii
-
List of Tables xiii
3.7 Activity Ratios for Our Test Application . . . . . . . . . .
. . 88
B.1 System Bus Signals . . . . . . . . . . . . . . . . . . . . .
. . . . . 151B.2 System Memory Map All addresses are in hex . . . .
. . . . . . . 152B.3 System Interrupt Map Lists all of the
interrupts in the prototype
and the source of the interrupt. . . . . . . . . . . . . . . . .
. . . . . 153B.4 Power Domains in the PrototypeLists all of the
power domains
in the prototype including virtual power domains and power
domainsfor testing only. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 154
-
Citations to Previously Published Work
The architecture presented in Chapter 3 first appeared in the
following paper:
An ultra low power system architecture for sensor network
applications,Mark Hempstead, Nikhil Tripathi, Patrick Mauro,
Gu-Yeon Wei, andDavid Brooks, In The 32nd Annual International
Symposium on Com-puter Architecture (ISCA), June 2005.
The PowerTOSSIM simulator, presented in Section 3.1.2 including
figure 3.1, ap-peared in:
Simulating the Power Consumption of Large Scale Sensor Network
Ap-plications, Victor Shnayder, Mark Hempstead, Bor-Rong Chen,
GeoffWerner Allen, and Matt Welsh, In Proceedings of the Second ACM
Con-ference on Embedded Networked Sensor Systems (SenSys),
Baltimore,MD, Nov 2004.
The evaluation of process technology selection, presented in
Section 3.5, appeared in:
Architecture and Circuit Techniques for Low-Throughput,
Energy-ConstrainedSystems Across Technology Generations, Mark
Hempstead, Gu-YeonWei and David Brooks, In Proceedings of the
International Conference OnCompilers, Architecture, And Synthesis
For Embedded Systems(CASES).Seoul South Korea. October 2006.
The related work, presented in Section 4.3 and Appendix A was
first surveyed in thefollowing invited paper:
Survey of hardware systems for wireless sensor networks, Mark
Hemp-stead, Michael J. Lyons, David Brooks and Gu-Yeon Wei. ASP
Journalof Low Power Electronics, Vol. 4., No. 1, April 2008.
The Navigo model presented in Chapter 2 is currently under
submission in the fol-lowing paper:
Navigo: A Model to Study Power-Constrained Architectures and
Spe-cialization, Mark Hempstead, Gu-Yeon Wei, and David Brooks
[UnderSubmission]
The measurement results of our prototype, presented in Chapter
4, are currentlyunder submission:
An accelerator-based wireless sensor network processor in 130nm
CMOS,Mark Hempstead, David Brooks, and Gu-Yeon Wei, [In
preparation]
xiv
-
Acknowledgments
The path to this PhD has been an adventure, and I would like to
take this op-
portunity to thank all of those who have helped and supported me
along the way.
Throughout my journey the path was often hard to find and,
without the guidance and
encouragement from these individuals, I would have never
overcome the academic,
technical, and emotional challenges that blocked my way.
First, I would like to thank my advisers Gu-Yeon Wei and David
Brooks for taking
a chance on me to start a fruitful collaboration across the
disciplines of circuit design
and architecture. Throughout the last few years they have
supported and guided my
transformation as a researcher. I appreciate the endless hours
they spent providing
feedback on talks, papers, and chips, pushing me to think more
deeply. Early in
my research career I received valuable feedback from my
qualification committee,
Woodward Yang and Paul Horowitz. I am grateful to Margo Seltzer
for her instruction
in paper writing and presentations in CS261 and, more recently,
for agreeing to serve
on my dissertation committee.
Throughout the duration of my research project, several
individuals helped me
with architecture exploration and early Verilog coding,
including: Nikhil Tripathi,
Patrick Mauro, and Xiaoyao Liang. Michael Lyons and I have
enjoyed a strong col-
laboration brainstorming the design of SMASH, next generation
architecture. I wish
to thank the other members of the Mixed-signal VLSI and
Architecture groups: Am-
ber Tan, Ruwan Ratnayake, Andrew Liu, Hayun Chung, Ankur
Agrawal, Wonyoung
Kim, Durlov Khan, Meta Gupta, Benjamin Lee, VJ Reddi, and Kevin
Brownell.
They provided invaluable instruction and support when I was met
with problems
using CAD tools, test equipment, and architecture simulators.
Moreover, they were
xv
-
Acknowledgments xvi
the source of supportive conversations at lunch, over dinner and
during late night
tape-outs.
Halfway through my grad student career, our group received the
gift of Glenn
Holloway, whose management of our machines and debugging support
at all hours
saved me weeks of frustration. Jim MacArthur in the Cruft
circuits lab was an
invaluable resource when I needed help designing PCBs,
soldering, or finding random
parts. Because my research crossed into the systems realm, early
collaborations with
the wireless sensor network (WSN) groupincluding Matt Welsh,
Geoffrey Werner
Challen, Victor Shnayder, and Bor-Rong Chenhelped me understand
the needs
of the WSN community. Im thankful to UMC and the SRC for
supporting the
fabrication of my two test chips. I would like to thank Joel
Emer, Mark Charney, and
Geoffrey Loweny for hosting me at Intel in Hudson, MA for a
summer and exposing
me to research in higher performance systems.
For me grad school was more than just researchI had the
opportunity to en-
gage in a diverse set of opportunities from teaching to graduate
student organization
and the Harvard house system. Harry Lewis introduced me to his
unique course,
QR48:BITS, and he was a wonderful teaching mentor who gave me
the chance to try
my hand at lecturing. Likewise, Woodward Yang showed me how to
coach students
in engineering design in ES96. Im thankful to Hwa Chang and
Jeffery Hopwood at
Tufts for mentoring me after I took over the digital logic class
this semester. I would
like to encourage the students who have taken over the graduate
student life commit-
tee to continue the good work of building a community within
SEAS and motivating
graduate students to leave their labs occasionally. For the past
three years, my fellow
-
Acknowledgments xvii
tutors, masters, and students have made Lowell House into a
vibrant and supportive
home.
Throughout my graduate school experience, it was the support of
my caring friends
and family that kept me going. Specifically, I would like to
thank my parents, David
and Rolande, who brought me up with such caring and supported me
with a smile
when I turned down a job in the real world for graduate school.
My father, who
taught me to think like an engineer at a young age through his
probing questions at
the dinner table, continues to challenge me today. My mother,
who rightly believes I
need emotional support just as much as technical support,
continues to pick me back
up after each paper rejection. My sister Amy was my lifeline
here in Boston over the
past few years. Though she has suppressed her engineering genes,
she continues to
surprise me with a display of her scientific mind over a bottle
of wine. My brother
Chris, the more practical engineer, taught me how to put a
square peg in a round
hole with a big hammer. His thoughtfulness and ingenuity just
might convince
me to start a company with him ... someday. Finally, I cannot
give enough thanks
to Megan, whose caring, kindness, and support over the last few
years made this
dissertation possible and easier to read. I look forward to many
more adventures
together and one more dissertation between us.
-
Dedicated to those who have paved the way for me
my parents David and Rolande,
and my grandparents David and Margaret Hempstead, and Rudy
and
Lillian Perreault.
xviii
-
Chapter 1
Introduction and Summary
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . .
. . . . . . 3
1.1.1 Technology and Trends . . . . . . . . . . . . . . . . . .
. . 3
1.1.2 Market Requirements . . . . . . . . . . . . . . . . . . .
. . 4
1.2 Holistic Approach . . . . . . . . . . . . . . . . . . . . .
. . 6
1.3 Accelerator-based Architectures . . . . . . . . . . . . . .
7
1.4 Summary of Contributions . . . . . . . . . . . . . . . . . .
8
Advances in computational capabilities have driven the
information technology
revolution, which in turn has driven advances in nearly all
fields of science, medicine,
and business. Although incredibly powerful computing devices are
available today,
this single-minded pursuit of performance has made power
consumption one of the
main bottlenecks for nearly all types of computing systems, from
high-end servers
to wireless sensor devices. Due to limitations in device cooling
at the high-end and
battery technology at the low-end, processor designs are
increasingly stratified into
power-constrained market segments in which the challenge is to
increase processor
1
-
Chapter 1: Introduction and Summary 2
performance for a fixed power budget. While advanced fabrication
technology are
projected to continue to provide computer designers a doubling
of transistors per
generation, slowing constant-field scaling and worsening wire
parasitics will see the
energy per switching event scale at a rate in which chip power
will essentially re-
main constant with fixed clock frequency and core activity.
Current trends towards
large multi-core systems utilize the additional transistor
bounty for additional power-
efficient cores but, with single-thread performance saturated,
most benefits will come
through thread-level parallelism. Assuming an optimistic
scenario for the continued
extraction of thread-level parallelism from workloads, chip
performance gains will
track growth in transistor counts. The International Technology
Roadmap for Semi-
conductors (ITRS) projects a doubling in the number of
transistors every three years
(e.g., 1.25x per year) leading to an increasing gap between
projected performance
growth and historical performance growth rates. Bridging this
performance gap will
require an architectural paradigm shift to augment the
multi-core trend, in which
an increasing fraction of chip real estate must be devoted to
specialized logic that
provides significant benefits in performance per switching event
for a growing portion
of workloads.
This dissertation argues that maintaining growth in system
performance requires
using transistors more efficiently to achieve higher performance
per watt. The power
consumption of a computing device depends on all layers of the
design space, from
the application software, to circuits and process technology and
system architecture.
This work takes a holistic approach by developing models and
designs incorporating
all layers of the design space. In this chapter, we describe the
technology and mar-
-
Chapter 1: Introduction and Summary 3
1985 1990 1995 2000 2005 2010 2015 2020
102
104
106
Year
CP
U P
erfo
rman
ce
Histor
ical T
rend:
1.58x
Power Constrained Era
Multi-core
Single-thread
Performance Predictions
Figure 1.1: Growth in Microprocessor Performance. Historically
the industryhas observed a total 1.58x performance gain per year.
Power consumption constraintsinhibit performance growth causing a
gap between expected and delivered performance.Data from Hennessy
and Patterson [25] and spec.org [54].
ket conditions that motivate this work and our holistic
approach. We also describe
accelerator-based architectures in general and allude to a
prototype that we designed
and taped-out for this work. Finally, we summarize the main
contributions of this
work.
1.1 Motivation
1.1.1 Technology and Trends
Over the past few decades the performance of microprocessors has
grown steadily.
However, over the past several years designers have been forced
to slow the growth
-
Chapter 1: Introduction and Summary 4
of single thread performance because of increasing power
consumption. To explore
these trends, Figure 1.1 plots both historical performance
growth and projected multi-
core and single-threaded performance growth until 2020. All data
in the plot is
relative to the VAX 11/780 as measured by SPECint benchmarks
data in the plot
previous to 2005 was obtained from Hennessy and Patterson, and
data for recent years
was obtained using the highest single-die performance
SPECint2006 (single-thread)
and SPECint2006rate (multi-core) from the SPEC website [25, 54].
Performance
growth began to deviate from the historical 1.58x per year trend
in 2001, primarily
due to the difficulty of obtaining additional clock frequency
and instruction-level
parallelism improvements in the face of power constraints. The
computing industry
has reacted to this trend by concentrating on multi-core designs
that capture thread-
level parallelism. Unfortunately, as detailed in this work,
power issues will limit
multi-core performance growth from meeting the historical trend,
and closing this
gap will require more efficient use of transistors.
1.1.2 Market Requirements
The growth of the semiconductor industry has not only been
driven by perfor-
mance gains but also by a growing diversity of applications for
microprocessors. Mi-
croprocessors have moved out of government and corporate
computing centers into
homes, schools, coffee shops, and, now, pockets and pocket
books. As microproces-
sors have found additional uses beyond high performance and
desktop computing,
new design constraints are being applied to microprocessors
among them power,
size, and cost.
-
Chapter 1: Introduction and Summary 5
Power consumption is increasingly the primary design constraint
for mobile and
embedded devices, as designers try to maximize battery life and
reduce cooling cost.
The performance and power consumption requirements across market
segments vary
by several orders of magnitude high-performance servers have a
power limit of
200W while some processors for laptops and netbooks are designed
to consume a
maximum of 1-10 W (Chapter 2 includes a more detailed list of
market segments and
power constraints). The power constraints imposed by the market
are contradictory
to the increase in power density caused by technology scaling.
Because, mobile and
embedded devices are untethered from the power grid, power
consumption has been
a concern within these communities for some time.
The emerging market segment of wireless sensor networks (WSNs)
places even
more stringent power constraints on processor design and
therefore is an indicator
of what is to come for the other market segments in the future.
Wireless sensor
networks have applications in medicine, science, industrial
automation and security.
WSN nodes are often deeply embedded in an environment and
decoupled from the
wired power grid. Consequently, designers would like used
scavenged energy to power
WSN devices indefinitely. Currently available energy scavenging
methods place a
power consumption constraint of roughly 100W on microprocessors
designed for
environmentally powered WSNs (a more detailed background of WSNs
and energy
scavenging is presented in Section 3.1). These strict limits on
power consumption
provide increased design pressure to maximize
performance-per-watt. As technology
scales and power density increases, other market segments will
face similar design
challenges.
-
Chapter 1: Introduction and Summary 6
83
Research Strategy
Application
Holistic Approachaddresses power
consumption at all layers
Architecture informed by modeling and prototyping
Architecture
Circuits
Process Tech
Network
Circuit Simulations
Prototyping
Design (Architecture/Circuits)
Modeling (Power + Performance)
(a) Holistic Approach 83
Research Strategy
Application
Holistic Approachaddresses power
consumption at all layers
Architecture informed by modeling and prototyping
Architecture
Circuits
Process Tech
Network
Circuit Simulations
Prototyping
Design (Architecture/Circuits)
Modeling (Power + Performance)
(b) Research Cycle
Figure 1.2: Research Approach. We take a holistic approach to
research un-derstanding and addressing power consumption at all
layers of the design space. Ar-chitecture innovations are informed
by modeling and prototyping.
This work investigates the impact of technology scaling on power
consumption. As
this section has described, the pressures of a power-constrained
era require designers
to think about improving performance per watt by using
transistors more efficiently.
This work takes a holistic approach looking at all areas of the
design space, using the
emerging domain of WSNs as a case study in ultra-low power
design.
1.2 Holistic Approach
During the course of our research, we have taken the view that
all layers of the
design space influence power consumption, from the application
and network to the
architecture and circuits. Figure 1.2 provides a graphical
description of the research
approach we employed. Our research efforts follow an iterative
approach through
-
Chapter 1: Introduction and Summary 7
modeling, design and prototyping and our models incorporate
inputs from a variety
of design layers. For example, the PowerTOSSIM model (Section
3.1.2) accepts inputs
from the network and application layers and physical power
measurements of nodes
while the Navigo model (Chapter 2) takes data from circuit
simulations, process
technology data and performance benchmarks of different
architectures.
We use modeling to guide design decisions which are verified by
circuit simulations
and prototyping. Chapter 3 describes a design motivated by the
modeling of appli-
cation behavior and addresses leakage current, which is
increasing due to technology
scaling. Because our power consumption targets are so low, we
developed a prototype
in 130nm CMOS to verify that our design achieves ultra low power
operation. Both
the power and performance measurements of the prototype,
presented in Chapter 4,
prompt more analysis and modeling of generalized
accelerator-based architectures.
Consequently, results from our prototype and modeling efforts
will drive our future
research efforts.
1.3 Accelerator-based Architectures
Both the trends in technology and market pressures to increase
power efficiency
reveal the need to extract more computation for each transistor
switch. Many de-
signers intuitively believe that application specific integrated
circuits (ASICs) pro-
vide higher performance and increased energy efficiency over
general purpose based
designs. However, ASICs are tuned for a particular set of
computations and hence do
not posses the flexibility and programmability of a general
purpose processor. One
approach, used by the system-on-chip community, places ASIC
accelerators on a chip
-
Chapter 1: Introduction and Summary 8
with a general purpose microcontroller. As we show in this work,
an accelerator-based
approach has the potential to compensate for the loss of
performance due to power
constraints. We show that maximizing total system performance
requires that the
accelerators provide application speedup (S) for a large
fraction of the workload (f).
The regular nature of computation and the ultra-low power
requirements of the
WSN application domain make it well-suited to benefit from an
accelerator-based
architecture. As a case study of accelerator-based
architectures, we designed and
implemented a processor for WSN applications. Our implementation
utilizes the
modular nature of the architecture to turn off unused
accelerators and address leakage
current with architecture. We also do away with the notion that
the system needs
to be controlled by a high powered general purpose core and,
instead, we replace it
with an event-driven state machine. Traditionally, the energy
efficiency of a system
has been evaluated through the metric of energy-per-instruction.
The concept of
instruction is lost on accelerator-based architectures and,
therefore, we propose several
new methods to analyze the efficacy of our prototype.
1.4 Summary of Contributions
This work presents the combined contributions of four different
modeling and
analysis frameworks and a ground-up silicon implementation of a
processor for wire-
less sensor networks. Following the research approach presented
in Section 1.2, the
modeling frameworks are informed by several layers of the design
space applica-
tions, architecture, circuits, and process technology. The
Navigo model, presented
in Chapter 2, accepts libraries that describe architecture
features, process technology
-
Chapter 1: Introduction and Summary 9
characteristics, voltage and frequency relationships from
circuit simulations. Through
the analysis of the inputs, Navigo reports an estimate of
performance and power con-
sumption for future generations of microprocessors. The results
revel that power con-
sumption increasingly limits performance. Subsequent analysis
with Navigo shows
that specialization can provide the necessary
performance-per-watt. However, the
high level analysis from Navigo needed to be grounded in a real
implementation to
understand the benefits and costs of accelerator-based
architectures. The design of
our prototype was informed by our modeling efforts of wireless
sensor network applica-
tions with PowerTOSSIM, presented in Section 3.1.2, and a
understanding of process
technology trends. Likewise, the architecture of the prototype
drives the analysis in
Chapter 4 and the process technology study in Section 3.5.
Through the models and
prototype, this work presents the following insights and major
contributions.
Navigo: A Model to Study Power-Constrained Architectures and
Specialization (Chap-
ter 2)
Modeling Framework for Early Exploration Currently designers use
intuition
and spreadsheet-based models to explore design decisions and
estimate power
consumption and performance of architectures five to fifteen
years away from
tape-out. Navigo provides features not available in
spreadsheet-based models
including voltage-frequency scaling to meet power constraints
and input from
circuit simulations. By incorporating different architecture
models, Navigo can
be used to model massive multi-core designs.
Amdahls Law for Specialization We enhanced Amdahls law to model
het-
erogeneous accelerators that can provide a speedup (S) for a
fraction of appli-
-
Chapter 1: Introduction and Summary 10
cations (f). Including the enhanced Amdahls law and architecture
models of
specialized accelerators, Navigo can be used to compare
homogeneous multi-core
designs with designs that include specialized accelerators.
Results show Increasing Effect of Power Constraints Results
using Navigo
reveal that performance of multi-core systems will be
significantly reduced due
to power constraints. While some designers intuitively
understand this result,
our work it is one of the first quantitative presentations of
this issue. This result
should serve as a call to action to develop systems with a
higher performance-
per-watt.
Analysis for Amount of Specialization By including specialized
accelerators
in the model, we use Navigo to select the amount of
specialization (both S and
f) required to maintain the performance growth shown in the
semiconductor
industry. This analysis gives designers the target amount of
area to allocate to
specialization in designs over the next decade.
Accelerator-Based Architecture for Wireless Sensor Networks
(Chapter 3)
Holistic Design Informed Through Application and Circuits We
built the
PowerTOSSIM to study the power consumption of WSN applications.
We used
insights gained from PowerTOSSIM to guide our design of the
system architec-
ture.
Accelerator Based Event-Driven Architecture The custom
architecture for
WSN includes hardware accelerators for regular tasks, we
offloaded event pro-
-
Chapter 1: Introduction and Summary 11
cessing to a custom hardware component (Event Processor), and we
address
leakage power with architecture support for VDD-gating.
Performance Improvements over Mica2 A SystemC model of the
architecture
shows a 10x performance improvement over the Mica2 architecture
for typical
WSN tasks.
Framework for Process Technology Selection We built a framework
to eval-
uate the selection of process technology. We based the framework
on a Verilog
model of the architecture and circuit simulations of different
process technology
generations. The results show that because of increasing leakage
current, the
most advanced process technology node is not the best choice to
minimize total
system power consumption.
Silicon Implementation and Evaluation of Accelerator Based
Systems (Chapter 4)
Prototype Chip in 130nm CMOS We built a prototype as a case
study of
accelerator based architectures. It incorporates synthesized
accelerator blocks,
custom VDD-gating circuit, and 2 KB of SRAM for a total of
444,982 transis-
tors.
Functional Verification and Per Block Power Measurements We
verified the
prototype for functionality and it functions correctly up to
12.5 MHz at 550
mV. Post layout simulations estimate that the system could run
up to 100 MHz
at 1.2V. Measurements of per block power show that VDD-gating
saves up to
100x of idle leakage power.
-
Chapter 1: Introduction and Summary 12
New Metric of Energy per Task and Comparison to Related Work
The
traditional metric of energy-per-instruction does not accurately
measure an
accelerator-based architecture. Therefore we introduce two new
metrics of En-
ergy per Task and Energy per Equivalent Instruction to compare
the prototype
to related work. With a measured energy per task of 678.9 pJ and
energy per
equivalent instruction of 0.44 pJ this system is the lowest
energy processor cur-
rently available for WSNs.
Analysis of Accelerator Speedup and Energy Savings We isolate
the benefits
of accelerator based computing by comparing hardware and
software implemen-
tations of the routines expressed by the accelerators. The
results show a 15x to
635x performance speedup and a 10x to 600x energy savings,
depending on the
routine.
Comparison to General Purpose designs through Workload Analysis
with Volt-
age and Frequency Scaling (VFS) We compare our system against a
general
purpose design while sweeping workload intensity. Voltage and
frequency scal-
ing and VDD-gating are included in the analysis. The results
show that the
architecture is well-suited for low duty cycle applications and
at the same time
can provide more performance for high intensity workloads than
general purpose
designs.
This work provides both a high-level justification for
accelerator-based architec-
tures and a case study built from the ground up. The work
concludes with a discus-
sion of some of the open research questions in this area and a
description of current
research efforts.
-
Chapter 2
Navigo: A Model to Study
Power-Constrained Architectures
and Specialization
Contents2.1 Navigo: A Model for Performance Trends in Future
Technologies . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1.1 Modeling Methodology and Sample Libraries . . . . . . . .
17
2.2 Power-constrained Performance for Multi-core . . . . . .
23
2.2.1 Results without Power Constraints . . . . . . . . . . . .
. . 24
2.2.2 Results with Power Constraints . . . . . . . . . . . . . .
. . 26
2.3 Validating the Model . . . . . . . . . . . . . . . . . . . .
. 31
2.4 Modeling Specialization . . . . . . . . . . . . . . . . . .
. . 34
2.4.1 Variant of Amdahls Law for Specialization . . . . . . . .
. 35
2.4.2 Examples of Specialized Cores . . . . . . . . . . . . . .
. . 38
2.5 Model Limitations and Future Directions . . . . . . . . .
42
13
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 14
Given the technology scaling trends and market requirements
presented in Sec-
tion 1.1, it is important for chip architects to understand the
limitations of homoge-
neous parallelism and to consider more radical architectural
approaches. This chapter
presents Navigo, a model that incorporates technology scaling
effects to predict future
power-constrained performance trends. Navigo can be used to
predict, for a variety
of processor cores, circuit parameters, and market segments,
performance trends and
shortfalls from the historical growth rate. Future designs that
seek to bridge this gap
must more effectively utilize switching events through
specialized hardware. Special-
ization hardware can take many forms [11, 29, 36, 38] including
programmable SIMD
units, hardcoded ASIC cores, or reconfigurable logic, and Navigo
includes a general
analytical model that can capture the impact of parallel
specialization on power-
constrained performance gains. This model projects the amount of
specialization,
quantified in terms of several parameters, that will be required
in future technology
generations to meet the historical performance scaling
trends.
In addressing the problem of power-constrained performance
scalability, the chap-
ter makes the following contributions:
We describe Navigo (Section 2.1), a model incorporating
technology scaling,
circuit design parameters, and architectural design decisions
into a high-level
model to facilitate understanding the impact of
power-constrained performance.
We use Navigo to understand a large design space of input
parameters (Sec-
tion 2.2).
We extend Navigo to model parallelizable specialization hardware
(Section 2.4),
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 15
introducing additional parameters to quantify specialization
benefits and power/area
costs. This model demonstrates that in order to maintain
historical performance
growth, we must increase the amount of specialization for each
technology gen-
eration.
2.1 Navigo: A Model for Performance Trends in
Future Technologies
Trends in process technology scaling, predicted by the
International Technology
Roadmap for Semiconductors (ITRS), consider a variety of factors
that affect the
performance scalability of future computing systems. Designers
can no longer rely on
the next technology node to increase circuit performance and
reduce energy consump-
tion. Constant-field scaling (or Dennard scaling [64]) has run
out with limits imposed
on how aggressively one can reduce transistor threshold voltages
(Vth) and supply
voltage (Vdd). The dramatic increase in leakage current has
effectively flattened out
Vth scaling for planar CMOS technologies such that supply
voltage scaling has also
slowed down. While technology continues to reduce transistor
size, wire parasitics
are getting worse after a short respite gained by moving to
copper. Lastly, the power
ceiling imposed by cooling costs and battery life further limit
performance gains tra-
ditionally offered by technology scaling. In short, the
landscape of processor design
has changed dramatically since the end of the twentieth century.
It is imperative to
arm future designers with tools that can navigate through the
complex interactions of
future process technology scaling trends on architectural and
circuit design choices,
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 16
Process Technology(ITRS)
Circuits(HSPICE)
Architecture(General purpose and
specialized cores)
User-defined Inputs:Technology nodeVdd (nominal,min)Frequency#
of cores and typeMarket selection
NavigoMarket constraints(Server, Desktop,
Mobile, WSN, etc.)
Outputs:ThroughputPower
Figure 2.1: Graphical depiction of Navigo. The model accepts
library files for pro-cess technology, circuits, architecture, and
market segments, and computes total andconstrained power for a set
of user-defined inputs such as supply voltage, frequency,etc.
coupled with power budget limitations imposed across different
markets segments. To
this end, we present Navigo, a detailed model that incorporates
the effects of process
technology, circuits, architecture, and market to predict future
processor performance
trends.
This section begins with a high-level overview of Navigo, which
outlines the basic
goals and assumptions made. Then, it describes the inner details
of the model,
revealing how it can be used by designers in early stages of
design to help guide
high-level system and architectural design decisions.
Navigo provides designers with a powerful and flexible tool to
navigate the in-
tricate tradeoffs between process technology, circuits, and
architecture, in order to
predict their implications on performance in future processor
designs. Figure 2.1
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 17
presents a high-level graphical representation of Navigo. The
model takes in a vari-
ety of input libraries, which quantify detailed parameters
corresponding to process
technology, circuit performance, architecture, and market
segment constraints. While
each of these libraries can be modified by the user, Navigo
includes built-in libraries
based on ITRS technology scaling predictions out to 11nm
(available in 2020), pre-
dictive technology models (PTM) [47, 67], IPCs of currently
available processor cores
(based on SPECint2006 scores), and high-level power and area
constraints for differ-
ent market segments. With the libraries in place, the designer
can sweep a variety
of input parameters such as technology node, voltage, frequency,
target market, etc.
Navigo then outputs the total system throughput and power. The
user can then
refine her design by iterating through different input
parameters to meet a specific
throughput and/or power target.
2.1.1 Modeling Methodology and Sample Libraries
An engine that takes the various libraries and input sweep
parameters to calcu-
late throughput and power consumption is at the core of Navigo.
This engine must
consider a variety of factors such as the number and
characteristics of computational
blocks (i.e. cores), voltage and frequency scaling, wire
loading, leakage power, and
process technology, all constrained by power budget limitations.
All of these factors
are quantified by the different library parameters.
The process technology library quantifies several parameters and
characteristics
utilized by Navigo, which are listed in Table 2.1. These
parameters set the basic
device and wire characteristics that Navigo uses to determine
circuit speed, power,
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 18
Year of Production 2007 2010 2013 2016 2019 2022Planar Bulk
Double Gate
Approximate node (nm) 65 45 32 22 16 11Supply Voltage (V) 1.1
1.0 0.9 0.8 0.7 0.65Physical Gate Length (nm) 25 18 13 9 6.3 4.5Id
sat (uA/um) 1211 1807 2204 2627 2768 2786Intrinsic delay (ps) 0.64
0.46 0.26 0.15 0.1 0.08Intrinsic switching energy (fJ) 0.064 0.045
0.020 0.0085 0.0037 0.0020RC delay of 1mm wire (ps) 890 2100 4555
10652 23515 58525Die Size-Server (mm2) 310 310 310 310 310
310Number of Transistors (M) 1106 2212 4424 8848 17696 35391
Table 2.1: Predicted Process Technology Characteristics.
High-PerformanceMicroprocessor Technology ITRS 2007 Edition
[50].
and the number of cores that will be available in future
technology nodes. The built-
in process technology library uses published data from ITRS 2007
[47, 67] out to the
11nm technology node anticipated in year 2022. ITRS predicts
double gate technology
will supplant planar bulk devices at the 32nm node in year 2013.
Because ITRS is a
predictive roadmap based on current projections of technology,
it is well-known that
the semiconductor industry has a history of either under- or
out-performing ITRS.
For example, Intels technology roadmap is more aggressive with
processors at the
45nm node already shipping and plans to introduce processors on
the 32nm node in
late 2009. Hence, this library can be readily modified by the
user to better reflect
updated ITRS projections or propriety information if available.
Table 2.2 compares
technology trends up to 1999 described by Borkar [2] to ITRS
2007 predictions, which
reveals a divergence in power density. This departure from
traditional constant-field
scaling affects frequency and voltage scaling in future designs,
which we thoroughly
explore in Section 2.2. Throughout the rest of this chapter, we
rely on technology
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 19
SourceTransistor Energy per Active Area Power
Delay Switch Power DensityBorkar99 [2] 0.70 0.34 0.49 0.49
1.00
ITRS07 (average) [50] 0.67 0.51 0.76 0.50 1.53
Table 2.2: Technology Scaling Factors. High-Performance
Microprocessor Logic.Indicates a departure from historical scaling
trends resulting in an increase in powerdensity. [50]
predictions made by ITRS 2007.
The circuits library utilizes predictive technology models (PTM)
[47, 67], available
from the 45nm node down to 16nm, to model how power and
frequency scale with
supply voltage and different amounts of wire parasitics. In the
absence of detailed cir-
cuit blocks that can be simulated, we rely on HSPICE simulations
of fanout-of-4 ring
oscillators across the technologies to determine basic
frequency, power, and voltage
trends. We combine ITRS predictions with PTM-based simulations
to extrapolate
trends at the 11nm node. These trends allow Navigo to scale
voltage and frequency to
meet different power budgets. It is also important to consider
the effects of imposing
minimum voltage (VddMIN) constraints since allowing arbitrary
reductions in supply
voltage can lead to a variety of issues related to six
transistor SRAM cell instability
issues [65] and exacerbation of on-chip voltage noise. Again,
the circuits library can
be modified by the user to model specific blocks if
available.
The architecture library contains a collection of processor
cores that the user can
choose to tile together in future multi-core systems. The
built-in architecture li-
brary consists of three cores currently in production, listed in
Table 2.3. These cores,
Intel Xeon (Netburst), Intel Core2Duo (Core), and Intel Atom,
represent high-end
server, desktop, and mobile CPUs. We plan to include analysis
for processors such
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 20
ProcessorTech Die Cores Vdd Freq Power IPC(nm) Size (V) (GHz)
(W) (SPEC06
(mm2) /GHz)
Intel Xeon 65 435 2 1.25 3.4 110 3.72(Tulsa) [18]
Intel Core2Duo 45 107 2 1.36 3 65 6.82(Wolfdale)
Intel Atom [12] 45 25 1 1.0 2.0 2.0 2.35
Table 2.3: Example Cores used in analysis. Data collected from
conference andjournal publications and datasheets. SPEC2006 results
used to determine IPC arefrom spec.org.
as Intels Core i7, as detailed information becomes available.
Parameters for the
processors were obtained from publications and SPEC scores in
spec.org for Xeon
and Core2Duo. Since official SPEC results are not available for
Atom, we extrap-
olate based on benchmark comparisons between Atom and an Athlon
with known
SPEC scores [57]. While different processors have been
implemented with different
technologies, the power, performance, and area of each core is
appropriately scaled by
Navigo utilizing the process technology and circuits trends
prescribed by their respec-
tive libraries. The user is not constrained by these cores, but
can also include other
user-defined cores into the architecture library. For example,
Section 2.4 explores the
impact of specialized cores.
The market segment library identifies different market segment
targets that con-
strain total area and maximum power. Table 2.4 lists examples of
different market
segments. Throughout the rest of the chapter, we focus on two
particular mar-
ket segmentsserver and mobile. The server market allows for a
maximum area of
300mm2 and maximum power of 198W as defined by ITRS. In
contrast, the mobile
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 21
Market Max Power (W) Die Area (mm2)MPU-CP Cost and Performance
151 140
MPU-HP High Performance 198 310MPU-PCC Power Cost and
Connectivity 3 70
Desktop-95 95 100Desktop-65 65 100
Mobile Standard Voltage 35 100Mobile Ultra-low Voltage 10
100
Table 2.4: Market Segment Constraints. Die size and Max Power
Consumptionfor a set of market segments. Values for the first three
markets came from ITRS [50].The final four market segments are
based on die size and thermal design point ofcommercially available
Intel Processors.
market allows for a maximum area of 100mm2 and maximum power of
35W. Again,
different markets segments and/or constraints can be easily
defined by the user via
changes to the library.
Finally, Navigos engine computes total throughput as
follows:
Throughput = Ncores freq(V dd, tech) IPCcore (2.1)
where the number of cores, Ncores, is defined by the total die
size (for a target market
segment) divided by the core chosen and scaled by technology
node. The IPC of each
core can be derived from published (or simulated for new cores)
SPEC benchmark
results and clock frequency of the core. Operating frequency
depends both on process
technology and voltage, and is calculated based on the original
frequency published
for the core. First, Navigo calculates the maximum frequency of
the core for nominal
voltage in the new technology. We incorporate both the intrinsic
switching delay of the
transistor and effects due to wire delay scaling. We scale logic
and wires independently
because the projected trends follow competing directions and are
modeled separately
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 22
in ITRS.
freqV ddNom=freqcorebasetech
fraclogicfreqswitchtech
freqswitchbasetech+fracwire
freqwiretechfreqwirebasetech
(2.2)
where basetech is the original technology in which the core was
fabricated. The nom-
inal frequency is then multiplied by PTM-based scaling factors
to calculate voltage-
specific frequencies.
Power depends on voltage, operating frequency, and the
transistor switching rate
of the architecture. We model average power with the following
expression:
Pavg = Pactive + Pleak freq (Eswitch Nswitching + Ewire) + Pleak
(2.3)
Traditionally power consumption is modeled as a sum of active
power and leak-
age power. Navigo computes active power as a sum of the number
of transistor
switches per second multiplied by the energy per switch. We
calculate switching rate
(Nswitching) from published frequency and power numbers. Since
energy per switch
(Eswitch) is technology dependent, it scales based on
voltage-dependent scaling factors
derived from HSPICE simulations for each technology node. Wires
scale differently
from transistors and, hence, are separately accounted for. We
assume leakage power
remains a fixed percentage of the total power consumption at
maximum frequency
and nominal voltage, which then scales with respect to different
operating voltage
levels. In order to accommodate different power budgets
prescribed by different mar-
ket segments, Navigo iterates through voltage and frequency
settings until a specific
power target is met. When the model encounters a VddMIN
constraint, it scales
frequency only to reduce power at the expense of inefficient
energy usage.
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 23
While Navigo seeks to combine a variety of factors to accurately
predict future
performance, it makes several optimistic assumptions. First, it
may not be feasi-
ble to fit an integer number of cores into a predefined area.
Hence, we allow for
half-size cores with IPC and power that scale linearly by one
half. Although this sce-
nario is infeasible, for near-term technologies (e.g. 45nm),
large area cores introduce
quantization effects that make it difficult to observe
consistent trends. This effect
becomes significantly less important as we scale to more
advanced technologies. Sec-
ond, future multi- and many-core systems will face a variety of
challenges to enable
core-to-core communications. Navigo optimistically assumes a
perfect on-chip inter-
connection network. Lastly, and perhaps most important, we
assume workloads can
be fully parallelized to keep all cores running continuously.
Hence, the model is or-
thogonal to Hills investigation that compares single-threaded
versus multi-threaded
parallelism [27]. One of the main objectives of developing
Navigo was to provide a
detailed and yet flexible model to help designers predict
performance trends and guide
future designs. Moreover, we use Navigo to show that despite
optimistic assumptions
of perfect thread parallelism that are run on highly-parallel
many-core designs, power
constraints will hamper performance growth and motivate
designers to seek out new
solutions beyond simply increasing the number of cores on a
die.
2.2 Power-constrained Performance for Multi-core
Navigo can be used to understand power-constrained performance
scalability across
technology generations. In this section, we demonstrate the
utility of Navigo by ex-
ploring the scalability of three classes of CPU architectures
when considering power-
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 24
constrained market segments (Table 3) and the impact of the
minimum supply voltage
constraint.
For each of these explorations, we make several assumptions.
First, we assume
that area and power will be fixed by the market segment. More
advanced technology
nodes provide an increase in the number of available transistors
leading to a doubling
of available cores per technology generation; however, frequency
benefits will be con-
strained by power limits. If the power budget is exceeded for a
given number of cores
and clock frequency, we scale voltage and frequency down to meet
the power bud-
gets, subject to circuit constraints on the supply voltage,
after which linear frequency
scaling is utilized.
2.2.1 Results without Power Constraints
To understand the impact of power constraints on scaling, we
first consider the
scenario where power is not a design constraint. Figure 2.2
illustrates this figure with
four sub-figures illustrating various outputs of the model when
scaled across technol-
ogy nodes for a fixed area budget of 310 mm2. The four
sub-figures quantify, across
the three core types, the number of cores, clock frequency,
total power, and total
chip throughput. Without power constraints, all metrics scale up
with technology.
Figure 2.2(a) shows that the number of Core2Duo cores starts at
around 6 in the
45nm node (recall that core count is scaled to meet the 310 mm2
budget), scaling
to 93 cores by 11nm. Without power limitations, frequency
scaling continues un-
abated surpassing 19.12 GHz for the Xeon core in 11nm, but this
comes at the price
of increased power dissipation, exceeding a kilowatt in the
worst case. Figure 2.2(d)
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 25
45 nm 32 nm 22 nm 16 nm 11 nm0
20
40
60
80
100
120
140
160
180
200
Technology
Num
ber
of C
ores
AtomXeonCore 2
(a) Number of cores
45 nm 32 nm 22 nm 16 nm 11 nm0
2
4
6
8
10
12
14
16
18
20
TechnologyF
req
(GH
z)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm0
500
1000
1500
Technology
Tot
al P
ower
(W
)
AtomXeonCore 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
105
Technology
Thr
ough
put
AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year
(Core2)
(d) Throughput
Figure 2.2: Results without power constraints across process
technologies.Results assume nominal voltage for specified
technology and MPU-HP market segmentwith a die size of 310 mm2.
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 26
plots total chip throughput relative to the Core2Duo from the
45nm technology node,
as calculated by increasing the core count along with frequency
improvement. The
throughput improvement increases at a slightly lower rate than
the historical growth
rate of 1.58x. This shows that if power is not a constraint,
performance growth could
be achieved through a combination of traditional frequency
scaling and multi-core
design.
2.2.2 Results with Power Constraints
Incorporating power constraints into our analysis gives a true
picture of expected
trends in future technologies. We show that for market segments
that tolerate higher
power density systems, scaling trends are better compared to
more constrained market
segments. In this section, we compare the server market segment,
which uses the same
310 mm2 die with a power limit of 198W, and the mobile market
segment, which uses
a 100 mm2 die with a power limit of 35W. Figure 2.3 and Figure
2.4 plot the server
and mobile market segment scalability analysis across the three
core types. Each
plot shows the required supply voltage, clock frequency, total
power, and total chip
throughput.
Focusing on the results for the server market segment, we
observe several impor-
tant trends. For the Intel Xeon design, power is already
constrained at the 45nm
technology node, and the design must reduce supply voltage from
nominal in order
to meet the power goal. When moving to the 32nm node, the Xeon
is able to achieve
a small frequency increase by operating at the minimum supply
voltage. Beyond
32nm, the Xeon frequency reduces slightly and then flattens out
as the power budget
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 27
45 nm 32 nm 22 nm 16 nm 11 nm0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
Technology
VD
D (
V)
AtomXeonCore 2
(a) Vdd
45 nm 32 nm 22 nm 16 nm 11 nm0
2
4
6
8
10
12
14
16
18
20
TechnologyF
req
(GH
z)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm20
40
60
80
100
120
140
160
180
200
220
Technology
Tot
al P
ower
(W
)
Atom
Xeon
Core 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
105
Technology
Thr
ough
put
AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year
(Core2)
(d) Throughput
Figure 2.3: Results with power constraints across process
technologies -Server. Results assume nominal voltage for specified
technology and MPU-HP marketsegment with a die size of 310 mm2 and
max power of 198 W.
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 28
45 nm 32 nm 22 nm 16 nm 11 nm0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
Technology
VD
D (
V)
AtomXeonCore 2
(a) Vdd
45 nm 32 nm 22 nm 16 nm 11 nm1
2
3
4
5
6
7
TechnologyF
req
(GH
z)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm10
15
20
25
30
35
40
Technology
Tot
al P
ower
(W
)
AtomXeonCore 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
Technology
Thr
ough
put
AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year
(Core2)
(d) Throughput
Figure 2.4: Results with power constraints across process
technologies -Mobile. Results assume nominal voltage for specified
technology and Mobile marketsegment with a die size of 100 mm2 and
max power of 35 W. Vdd is limited toVddMIN.
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 29
is soaked up by additional cores. In contrast, the Intel
Core2Duo design allows full
frequency scaling until the 22nm technology node, after which
scaling is curtailed;
in 11nm, frequency must be throttled when adding more cores. The
Intel Atom
core is much more power-efficient and can continue to scale
frequency until 11nm,
with additional power headroom. However, Atom starts with a
significant perfor-
mance disadvantage compared to Core2Duo, and hence by 11nm, the
Core2Duo and
Atom roughly converge on total throughput. In 11nm, the best
designs (Atom and
Core2Duo) are increasing at a rate of 1.35x per year, which by
11nm is nearly 6.6x
below the 1.58x per year curve.
The mobile market segment, seen in Figure 2.4 exhibits similar
trends, but the
tighter power constraints result in more severe reductions in
clock frequency, and
slowing in overall per-year throughput growth. For example, the
Core2Duo hits a
frequency cap around 32nm, and frequency flatlines until 16nm
when it slightly dips.
Even the Atom processor power caps at 16nm, after which
frequency also dips to
maintain the power budget.
An important issue that we see repeatedly throughout the above
scenarios is the
minimum Vdd constraint is met as we seek to fit designs with
many cores into fixed
power budgets by reducing voltage and clock frequency. When a
design reaches
this constraint, additional power reduction can only be achieved
through inefficient
frequency-scaling essentially linear reduction in clock
frequency offsets additional
cores. Practically speaking, designers may prefer to simply stop
scaling the num-
ber of cores in a system at this point. In order to understand
this effect, we have
run additional simulations with the constraint removed; the
results are shown in
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 30
45 nm 32 nm 22 nm 16 nm 11 nm0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Technology
VD
D (
V)
AtomXeonCore 2
(a) Vdd
45 nm 32 nm 22 nm 16 nm 11 nm1
2
3
4
5
6
7
Technology
Fre
q (G
Hz)
AtomXeonCore 2
(b) Frequency
45 nm 32 nm 22 nm 16 nm 11 nm10
15
20
25
30
35
40
Technology
Tot
al P
ower
(W
)
AtomXeonCore 2
(c) Power
45 nm 32 nm 22 nm 16 nm 11 nm10
1
102
103
104
Technology
Thr
ough
put
AtomXeonCore 2Ideal 1.58x/year (Core2)Ideal 1.35x/year
(Core2)
(d) Throughput
Figure 2.5: Results with power constraints across process
technologies with-out VddMIN constraints - Mobile. Results assume
nominal voltage for specifiedtechnology and Mobile market segment
with a die size of 100 mm2 and max power of35 W. Vdd can be reduced
without a lower limit.
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 31
Figure 2.5. We significantly reduce VDD to meet the power
constraints set by the
market, as low as 0.6V for in advanced technologies and Xeon and
Core2Duo mi-
croarchitectures. There is a clear loss in throughput for
systems under minimum
VDD constraints, Figure 2.4 (d), compared to systems without
minimum VDD con-
straints, Figure 2.5 (d). For the Atom processor, minimum Vdd is
not a severe issue.
For the mobile market segment in the 11nm node, scaling VDD
reduces throughput
by 13.4%. However, the minimum voltage constraint reduces the
throughput of the
Xeon core by 57.6% for the same target. Even without this
constraint, the Xeon still
performs poorly compared to the more power-efficient cores,
because running at very
low voltage does not provide ideal performance.
2.3 Validating the Model
This section presents a back-validation of Navigo for
microprocessors built from
1996 to 2007. Because of the predictive nature of the model, it
is difficult to validate
Navigos predictions of the power and performance of
microprocessors built using
future process technologies. Therefore, we validate the Navigo
based on an initial
data-point from 1996 against Microprocessors manufactured over
the last 10 years.
For validation, we seeded the microarchitecture library with the
DEC Alpha 21164
microprocessor, introduced in 1996 and manufactured in 350nm
technology. We de-
veloped the technology and circuits library based on ITRS data
from 1997 to 2007
and circuit simulation results, using SPICE models from industry
and PTM. In 1997,
the ITRS committee did not anticipate the growth in power
density that started with
the 180nm technology node. Therefore, for each node, we chose
the technology model
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 32
CPU Year Node(nm)
DieSize(mm)
Throughput Freq(GHz)
Power(W)
Alpha 21164 1996 350 210 481 0.5 31Alpha 21164 1997 350 141 649
0.6 40Alpha 21264 1998 350 314 993 0.6 73Alpha 21264A 1999 250 210
1267 0.7 85Pentium III 2000 180 106 1779 1.0 29Athlon 2001 180 130
2584 1.6 68Pentium 4 2002 130 146 4195 3.0 81.8Opteron 2003 130 193
5364 2.2 89Xeon 2004 130 237 5764 3.6 9264-bit Xeon 2005 90 81 6505
3.6 110Core 2 Extreme 2006 65 143 17909 2.93 75Xeon 3085 2007 65
143 23207 3 65POWER6 2007 65 341 35071 4.7 180
Table 2.5: Select Microprocessors from 1996 to 2007. Performance
data is fromthe analysis in Figure 1.1. Power consumption and die
size data was acquired fromdatasheets and published microprocessor
reports.
from the ITRS year closest to date of introduction. This
technique isolates the error
in ITRS predictions from the modeling framework.
We compare predictions from Navigo with microprocessors
manufactured between
1996 and 2007, shown in Table 2.5. We calculate throughput from
the same Hen-
nesey and Patterson and SPECint2006 benchmark data used to
develop Figure 1.1,
described in Section 1.1.1. We gathered power consumption data
from datasheets
and online microprocessor reports. The die size of the
microprocessors vary widely;
therefore, we compare throughput per unit area and power per
unit area.
Figure 2.6 (a) presents a comparison of throughput per unit area
predicted with
Navigo and the throughput of commercially available
microprocessors. The x-axis
represents both technology node and year of introduction. The
throughput of the
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 33
350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006
10
0
101
102
103
Technology
Thr
ough
put/A
rea
NavigoCommercial Microprocessors
(a) Throughput
350nm 1997 250nm 1999 180nm 2000 130nm 2002 90nm 2005 65nm 2006
0
100
200
300
400
500
600
700
800
Technology
Pow
er/A
rea
(mW
/mm
2 )
NavigoCommercial Microprocessors
(b) Power
Figure 2.6: Validation of Navigo using Microprocessors from 1996
to 2007.Predicted results use the most recent ITRS technology
models. The initial core modelis an Alpha 21164 0.5 GHz in 250nm
technology introduced in 1996. The data pointsrepresenting
commercially available systems are also presented in Figure 2.5
initial core, Alpha 21164 0.5 GHz, matches the predictions from
Navigo which reveals
the absence of static offset errors in the model. The throughput
predicted by Navigo
aligns well with the results from the benchmarked
microprocessors. Generally, Navigo
estimates the upper bound of throughput per unit area. To combat
increasing power
consumption, designers of microprocessors in the 65nm node
slowed the scaling of
clock frequency and choose to design multi-core processors made
of simpler cores.
Navigo overestimates the throughput of multi-core designs
because it assumes that
the costs of communication and thread synchronization are
zero.
While Navigo predicts a general trend of increased power
density, shown in Fig-
ure 2.6 (b), it does not predict the drastic jump in power
consumption caused by
changes in microarchitecture, as it assumes a fixed core design.
During the period be-
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 34
tween 1997 and 2005, microarchitects aggressively pursued
single-thread performance
resulting in several high-throughput and high-power consumption
designs. The deeply
pipelined Netburst microarchitecture, manufactured in 130nm
(Pentium 4 and Xeon),
had notoriously high power consumption. Subsequently, the
industry changed course
and introduced more power efficient multi-core designs. The
power consumption pre-
dicted by Navigo matches the initial core Alpha 21164 in 350nm.
Navigo also aligns
well the multi-core designs in the 65nm node, which utilize
cores that have microar-
chitectures similar to the Alpha. The model correctly shows the
transition between an
earlier erawhen constant field scaling was still possible and
power density remained
constant (350nm, 250nm, 180nm)and the current era of increasing
power density.
Our back-validation shows that Navigo predicts throughput well
and points out
general trends in power consumption. Navigo incorporates a
static model of mi-
croarchitecture, and thus for a more accurate prediction of
power consumption, users
should include cores in their libraries which best represent
their target core design.
2.4 Modeling Specialization
Consistent progress towards smaller, faster, and more numerous
transistors with
each generation of process technology no longer yields the
steady growth in comput-
ing performance enjoyed throughout the 20th century. The power
ceiling forced a
right-hand turn in single-thread performance and CPU designers
have been rac-
ing to implement multi-core systems ever since. Unfortunately,
Navigo predicts that
even for the server market segment, multi-core scaling will only
yield a 1.35x/year
performance growth trend. In order to get back onto the 1.58x
growth trend, design-
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 35
ers must maximize the efficiency of transistor (and wire)
switching. In other words,
designers must minimize the overheads associated with a
general-purpose (GP) CPU.
One obvious direction is to replace general-purpose computing
with dedicated, spe-
cialized hardware that offers higher computation per unit area
and power, for an
increasing fraction of the machines workload. IBMs CELL
processor is one such
example. It includes 8 SPEs, which are specialized cores used to
speed up SIMD
workloads [11]. Similarly graphics processing units (GPUs) have
been used exten-
sively by programmers to speedup tasks related to video
processing and other SIMD
operations. Another example may be to introduce dedicated
hardware specialized to
H.264 decoding. In order to understand the potential benefits of
specialization, this
section introduces a parallel-variant of Amdahls Law for
specialization. Then, by
augmenting Navigo with specialization, we project the amount of
specialization that
will be required in future computing systems to increase system
throughput by 1.58x
per year.
2.4.1 Variant of Amdahls Law for Specialization
Amdahls Law is commonly used to describe the theoretical
limitations of appli-
cation speedup given constraints on the fraction of the workload
that can be sped
up.
Speedupenhanced(f, S) =1
(1 f) + fS
(2.4)
where f is the fraction of the workload that can be enhanced and
S is amount of
speedup possible through enhancements. Amdahls Law has been
adapted to model
symmetric and asymmetric multi-core systems [27], where parallel
cores can execute
-
Chapter 2: Navigo: A Model to Study Power-Constrained
Architectures andSpecialization 36
(a) Calculation framework
101
100
100
102
100
101
102
103
fraction of workload (f)Speedup (S)
Thr
ough
put (
norm
aliz
ed)
(b) Throughput vs f and S
Figure 2.7: Speeding up an application with specialized cores. A
workloadis split to an additional set of resourcesthe specialized
core. The fraction of theapplication that can be executed on the
specialized core is f , with a speedup of S.
all workloads. With specialized cores, we must make a few
assumptions in order to
model speedup using Amdahls Law. First, we assume
special-purpose (SP) cores
can only run specific parts of an application (f) while
general-purpose cores can run
the entire workload, albeit with lower efficiency. Second, we
optimistically assume
that workloads are arbitrarily parallelizable (also previously
assumed in Navigo). T