Top Banner
John Mellor-Crummey Department of Computer Science Rice University [email protected] Microprocessor Trends and Implications for the Future COMP 522 Lecture 4 30 August 2012
47

COMP522 2012 Lecture4 Future

May 02, 2017

Download

Documents

mwijayak
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COMP522 2012 Lecture4 Future

John Mellor-Crummey

Department of Computer ScienceRice University

[email protected]

Microprocessor Trends and Implications for the Future

COMP 522 Lecture 4 30 August 2012

Page 2: COMP522 2012 Lecture4 Future

Context

• Last two classes: from transistors to multithreaded designs—multicore chips—multiple threads per core

– simultaneous multithreading– fine-grain multithreading

• Today: focus on hardware trends and implications for the future

2

Page 3: COMP522 2012 Lecture4 Future

The Future of Microprocessors

3

Page 4: COMP522 2012 Lecture4 Future

Review: Moore’s Law

• Empirical observation—circuit element count doubles every N months (N ~18)

– features shrink, semiconductor dies grow

• Impact: performance has increased 1000x over 20 years—microarchitecture advances from additional transistors—faster transistor switching time supports higher clock rates—energy scaling

4

Page 5: COMP522 2012 Lecture4 Future

Evolution of Microprocessors 1971-2009

5Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 6: COMP522 2012 Lecture4 Future

20 Years of Exponential Gains

Three key technology drivers

• Transistor speed scaling

• Core microarchitecture techniques

• Cache memory architecture

6

Page 7: COMP522 2012 Lecture4 Future

Transistor Speed Scaling (Dennard)

• Shrink transistor dimensions by 30% (.7x)—area shrinks by 50%—performance increases by ~ 40%

– .7x delay reduction = 1.4x frequency increase

• Reduce supply voltage by 30% to keep electric field constant—reduces energy by 65% or power @ 1.4x frequency by 50%

• Result: use same system power for 2x as many transistors

7

Page 8: COMP522 2012 Lecture4 Future

Dennard Scaling Details

• Scaling properties of CMOS circuits—R. Dennard, et al. Design of ion-implanted MOSFETs with very

small physical dimensions. IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.

• Linear scaling of all transistor parameters —reduce feature size by a factor of 1/𝜿, typically, 0.7/generation

• For a constant die area and design density, power and power density are constant and frequency can be increased

8

Page 9: COMP522 2012 Lecture4 Future

Core Microarchitecture Improvements

• Improvements—pipelining—branch prediction—out of order execution—speculation

• Results—higher performance—higher energy efficiency

9

Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Measure performance with

SPEC INT 92, 95, 2000

on-die cache and pipelined architectures beneficial: significant performance gain

without compromising energy

deep pipeline delivered lowest performance increase for same area and

power increase as OOO speculative

superscalar and OOO provided performance benefits at a cost in energy efficiency

Page 10: COMP522 2012 Lecture4 Future

Performance vs. Area Pollack’s rule: performance ↑ as (transistor count).5 or area

Captures area, power and performance tradeoffs of µarch generations

10Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 11: COMP522 2012 Lecture4 Future

Problem: Memory Performance Lags CPU• Growing disparity between processor speed and DRAM speed

—DRAM speed improves slower b/c optimized for density and cost

11

DRAM Density and Performance, 1980-2010• Speed disparity growing from 10s to 100s of processor

cycles per memory access• Lately flattened out due to flattening of clock frequency

Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 12: COMP522 2012 Lecture4 Future

Cache Memory Architecture

• Enables DRAM to emphasize density and cost over speed

• 2 or 3 levels of cache needed to span growing speed gap

• Caches—L1: high bandwidth; low latency → small—L2+: optimized for size and speed

12

• Initially, most transistors devoted to microarchitecture• Later, larger caches became important to reduce energy

Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 13: COMP522 2012 Lecture4 Future

Attribution of Performance Gains

Transistor speed vs. microarchitecture

13Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

• Almost 100x due to transistor speed alone

• Transistor speed gains leveling off due to numerous challenges

Page 14: COMP522 2012 Lecture4 Future

The Next 20 Years

• Last 20 years: 1000x performance improvement

• Continuing this trajectory: another 30x by 2020!

• Industry expects 1000x increase by 2030—exponential growth is an unforgiving metric!

14

Page 15: COMP522 2012 Lecture4 Future

New Technology Scaling Challenges

• Decreased transistor scaling benefits—despite continuing miniaturization

– little performance improvement– little reduction in switching energy– decreased performance benefit of scaling

transistor is not a perfect switch: leakage current exponential integration capacity exacerbates this effect

substantial portion of power consumption now due to leakage to keep leakage under control, can’t lower threshold voltage more

reduces transistor performance

• Flat total energy budget—package power and mobile/embedded computing drives energy

efficiency requirements

15

Page 16: COMP522 2012 Lecture4 Future

Ongoing Technology Scaling

• Increasing transistor density (in area and volume) and count—through

– continued feature scaling– process innovations– packaging innovations

• Need for increasing locality and reduced bandwidth per operation—as performance of microprocessors increases, data sets for

applications continue to grow

16

Page 17: COMP522 2012 Lecture4 Future

• If —add more cores as transistors and integration capacity increases—operate at highest frequency transistors and designs can achieve

• Then, power consumption would be prohibitive

• Implications—chip architects must limit number of cores and frequency to keep

power reasonable– severely limits performance improvements achievable!

Unconstrained Evolution vs. Power

17Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 18: COMP522 2012 Lecture4 Future

Transistor Integration @ Fixed Power

• Desktop applications—power envelope: 65W;

die size 100 mm2

• Transistor integration capacity at fixed power envelope— analysis for 45nm

process technology– ↑ # logic T– size of cache ↓

—as # logic T ↑, power dissipation increases

• Note: analysis assumes avg activity seen today

18Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

16MB cache, no logic: 10W

no cache, all logic: 90W

6MB cache,50M T logic: 65W

Matches Core 2 Duo

Page 19: COMP522 2012 Lecture4 Future

What about the Future? Crystal ball says ...

• Modest frequency increase per generation 15%

• 5% reduction in supply voltage

• 25% reduction of capacitance

• Then ...

• Over next 10 years, expect to follow Moore’s law for transistor increases, but increase logic 3x and cache > 10x

19Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 20: COMP522 2012 Lecture4 Future

Extrapolated Transistor Count for Fixed Power Envelope

20Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 21: COMP522 2012 Lecture4 Future

Energy Rules for the Future!

• Finite, fixed energy budget —requires qualitative shift for thinking about architecture and

implementation

• New rules—energy efficiency is a key metric for designs—energy proportional computing must be the goal for HW & SW

– with a fixed power budget, ↑ energy efficiency = ↑ performance

21

Page 22: COMP522 2012 Lecture4 Future

Key Challenges Ahead

• Organizing the logic: multiple cores and customization—researchers think single thread performance has leveled off—throughput can increase proportional to number of cores —customization can reduce execution latency—together, multiple cores and customization can improve energy

efficiency

• Choices for multiple cores

22

Page 23: COMP522 2012 Lecture4 Future

Three Scenarios for a 150M Transistor Chip

23

Hybrid approach

Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 24: COMP522 2012 Lecture4 Future

TI System-on-a-chip

24Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 25: COMP522 2012 Lecture4 Future

Some Design Choices

• Accelerators for specialized tasks—graphics—media—image—cryptographic—radio —digital signal processing—FPGA

• Increase energy efficiency by restricting memory access structure and control flexibility—SIMD—SIMT - GPUs require expressing programs as structured sets of

threads

25

Page 26: COMP522 2012 Lecture4 Future

Death of 90/10 Optimization

• Traditional wisdom: invest maximum transistors in 90% case—use precious transistors to increase single thread performance

that can be applied broadly

• However—new scaling regime (slow transistor performance, energy

efficiency) → no sense to add transistors to a single core as energy efficiency suffers

• Result: 90/10 rule no longer applies

• Rise of 10x10 optimization—attack performance as a set of 10% optimization opportunities

– optimize with an accelerator for a 10% case, another for a different 10% case, and then another 10% case, and so on ...

—operate chip with 10% of transistors active, 90% inactive– different 10% active at each point in time

—can produce chip with better overall energy efficiency and performance 26

Page 27: COMP522 2012 Lecture4 Future

On-die Interconnect Delay and Energy (45nm)

• As energy cost of computation reduced by voltage scaling, data movement costs start to dominate

• Energy moving data will have critical effect on performance—every pJ spent moving data reduces budget for computation

—if operands move on average 1mm (10% of die size), then at the rate of 0.1pJ/bit, the 576Tera-bits/sec of movement consumes almost 58 watts with hardly any energy left for computation

—if 10% bits move through on-chip network between cores, 10 hops on avg, network would consume 35W 27

Page 28: COMP522 2012 Lecture4 Future

Improving Energy Efficiency Through Voltage Scaling

• As supply voltage is reduced, frequency also reduces, but energy efficiency increases—while maximally energy efficient, reducing to threshold voltage

would dramatically reduce single-thread performance: not recommended

28Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 29: COMP522 2012 Lecture4 Future

Heterogeneous Many-core with Variation

Small cores could operate at different design points to trade performance for energy efficiency

29Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 30: COMP522 2012 Lecture4 Future

Data Movement Challenges, Trends, Directions

30Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 31: COMP522 2012 Lecture4 Future

Circuits Challenges, Trends, Directions

31Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 32: COMP522 2012 Lecture4 Future

Software Challenges, Trends, Directions

32Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 33: COMP522 2012 Lecture4 Future

Logic Organization Challenges, Trends, Directions

33Figure credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors.

Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507.

Page 34: COMP522 2012 Lecture4 Future

Take Away Points

• Moore’s Law continues, but demands radical changes in architecture and software

• Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and exploit the bounty of transistors to incorporate application-customized hardware

• Software must increase parallelism and exploit heterogeneous and application-customized hardware to deliver performance growth

34

Credit: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77

10.1145/1941487.1941507.

Page 35: COMP522 2012 Lecture4 Future

Looking back and looking forward: power, performance, and upheaval

35

Page 36: COMP522 2012 Lecture4 Future

Of Power and Wires

• Physical power and wire delay limits—derail clock speed bounty of Moore’s Law for current and future

technologies

• Power is now a first order constraint on designs for all markets—limits clock scaling—prevents using all transistors simultaneously

– Dark Silicon and the end of multicore scaling. Esmaeilzadeh et al. ISCA 11

36

Page 37: COMP522 2012 Lecture4 Future

Analyzing Power Consumption

• Quantitative performance analysis is the foundation for computer system design and innovation—lack of detailed information impedes efforts to improve

performance

• Goal: apply quantitative analysis to measured power—lack of detailed energy measurements is impairing efforts to

reduce energy consumption of modern workloads

37

Page 38: COMP522 2012 Lecture4 Future

Benchmark Classes

• Native non-scalable—single-threaded, compute-intensive C, C++, and Fortran

benchmarks from SPEC CPU2006.

• Native scalable—multithreaded C and C++ benchmarks from PARSEC

• Java non-scalable—single and multithreaded benchmarks that do not scale well from

SPECjvm, DaCapo 06-10-MR2, DaCapo 9.12, and pjbb2005

• Java scalable—multithreaded Java benchmarks from DaCapo 9.12 that scale in

performance similarly to native scalable

38

Page 39: COMP522 2012 Lecture4 Future

Power is Application Dependent

Each of 61 points represents a benchmark. Power consumption varies from 23-89W. The wide spectrum of power responses points to power saving opportunities in software.

39

Figure credit: Hadi Esmaeilzadeh, Ting Cao, Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2012. Looking back and looking forward: power, performance,

and upheaval. CACM 55, 7 (July 2012), 105-114.

Finding: each workload prefers a different HW configuration for

energy efficiency

Page 40: COMP522 2012 Lecture4 Future

Processors Considered

Specifications for 8 processors used in experiments

40

Page 41: COMP522 2012 Lecture4 Future

Power Consumption on Different Processors

Measured power for each processor running 61 benchmarks. Each point represents measured power for one benchmark. The “✗”s are the reported TDP for each processor.

41

Figure credit: Hadi Esmaeilzadeh, Ting Cao, Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2012. Looking back and looking forward: power, performance,

and upheaval. CACM 55, 7 (July 2012), 105-114.

Finding: power is application dependent and does not strongly

correlate with TDP

Page 42: COMP522 2012 Lecture4 Future

Power, Performance, & Transistors

• Power/performance trade-offs have changed from Pentium 4 (130) to i5 (32).

42

• Power and performance per million transistors. Power per million transistors is consistent across different microarchitectures regardless of the technology node. On average, Intel processors burn around 1 W for every 20 million transistors.

Power/performance trade-off by processor • Each point is an average of the 4 workloads

• (native, Java) x (scalable, non-scalable)

Figure credit: Hadi Esmaeilzadeh, Ting Cao, Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2012. Looking back

and looking forward: power, performance, and upheaval. CACM 55, 7

(July 2012), 105-114.

Page 43: COMP522 2012 Lecture4 Future

Energy/Performance Pareto Frontiers (45nm)

• Energy/performance optimal designs are application dependent and significantly deviate from the average case

43

Figure credit: Hadi Esmaeilzadeh, Ting Cao, Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2012. Looking back and looking forward: power, performance,

and upheaval. CACM 55, 7 (July 2012), 105-114.

Page 44: COMP522 2012 Lecture4 Future

CMP: Comparing Two Cores to One

44

Impact of doubling the number of cores on performance, power, and energy, averaged over all four workloads.

Figure credit: Hadi Esmaeilzadeh, Ting Cao, Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2012. Looking back and looking forward: power, performance,

and upheaval. CACM 55, 7 (July 2012), 105-114.

Energy impact of doubling the number of cores for each workload. Doubling the cores is not consistently energy efficient among processors or workloads.

Page 45: COMP522 2012 Lecture4 Future

Scalability of Single-threaded Java Counterintuitively, some single-threaded Java benchmarks scale well

underlying JVM exploits parallelism for compilation, profiling, and GC

45

Figure credit: Hadi Esmaeilzadeh, Ting Cao, Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. Looking back and looking forward: power, performance, and

upheaval. CACM 55, 7 (July 2012), 105-114.

Page 46: COMP522 2012 Lecture4 Future

Comparing Microarchitectures

Nehalem vs. four other architectures

In each comparison, the Nehalem is configured to match the other processor as closely as possible

46

Impact of microarchitecture change with respect to performance, power, and energy, averaged over all four workloads.

Energy impact of microarchitecture for each workload. The most recent microarchitecture, Nehalem, is more energy efficient than the others, including the low-power Bonnell (Atom).

Page 47: COMP522 2012 Lecture4 Future

Looking Forward: Findings

• Native sequential workloads don’t approximate managed workloads or even parallel workloads

• Diverse application power profiles suggest that future applications and system software will need to participate in power optimization and management

• Software and hardware researchers need access to real measurement to optimize for power and energy

47