Top Banner
Exploiting Dark Silicon in Server Design Nikos Hardavellas Northwestern University, EECS
31

Exploiting Dark Silicon in Server Design

Apr 27, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploiting Dark Silicon in Server Design

Exploiting Dark Silicon in Server Design

Nikos Hardavellas

Northwestern University, EECS

Page 2: Exploiting Dark Silicon in Server Design

© Hardavellas 2

Moore’s Law Is Alive And Well

90nm

90nm transistor

(Intel, 2005)

Swine Flu A/H1N1

(CDC)

65nm

2007

45nm

2010

32nm

2013

22nm

2016

16nm

2019

Device scaling continues for at least another 10 years

Page 3: Exploiting Dark Silicon in Server Design

© Hardavellas 3

Good Days Ended Nov. 2002

[Yelick09]

“New” Moore’s Law: 2x cores with every generation

Moore’s Law Is Alive And Well

Page 4: Exploiting Dark Silicon in Server Design

Exponential Growth of Core Counts

© Hardavellas 4

So, are 1000-core chips a viable architecture?

1

2

4

8

16

32

64

128

256

512

1975 1980 1985 1990 1995 2000 2005 2010 2015

# C

ore

s

Year

8086 80286 80386 80486 PentiumPentium Pro

P-II P-IIIP-4

Itanium

Power4

MIT-RAW

UltraSPARC IVPentium-D

Xeon Brisbane

Niagara

Core 2 Quad

Turion X2Core 2 DuoDenmark

Teraflops

Larrabee

TILE64

Sun Rock

Cell

XeonAgena Barcelona

Toliman

Core i7

IstanbulBecktonMagny-Cours

Page 5: Exploiting Dark Silicon in Server Design

Performance Expectations vs. Reality

© Hardavellas 5

0

5

10

15

20

25

2004 2007 2010 2013 2016 2019

Re

lati

ve P

erf

orm

an

ce

Year of Technology Introduction

Speedup under Moore's Law

Speedup under Physical Constraints

Physical constraints limit speedup

Page 6: Exploiting Dark Silicon in Server Design

Area vs. Power Envelope

© Hardavellas 6

Good news: can fit 100’s cores. Bad news: cannot power them all

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128256512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area (310mm)

Power (130W)

Page 7: Exploiting Dark Silicon in Server Design

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128256512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area (310mm)

Power (130W)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

Pack More Slower Cores, Cheaper Cache

© Hardavellas 7

The reality of The Power Wall: a power-performance trade-off

VFS

Page 8: Exploiting Dark Silicon in Server Design

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128256512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area (310mm)

Power (130W)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

Bandwidth (1 GHz)

Bandwidth (2.7GHz)

Pin Bandwidth Constraint

© Hardavellas 8

Bandwidth constraint favors fewer + slower cores, more cache

VFS

Page 9: Exploiting Dark Silicon in Server Design

0

100

200

300

400

500

600

1 2 4 8 16 32 64 128 256 512

Perf

orm

an

ce (

GIP

S)

Cache Size (MB)

Area (max freq)

Power (max freq)

Bandwidth, VFS

Area+Power, VFS

Area+P+BW, VFS

Example of Optimization Results

© Hardavellas 9

BW:

~2x loss

Power + BW: ~5x loss

Jointly optimize parameters, subject to constraints, SW trends

Design is first bandwidth-constrained, then power-constrained

Page 10: Exploiting Dark Silicon in Server Design

Core Counts for Peak-Performance Designs

© Hardavellas 10

1248

163264

128256512

1024

2004 2007 2010 2013 2016 2019

Nu

mb

er

of

Co

res

Year of Technology Introduction

Max EMB Cores

Embedded (EMB)

General-Purpose (GPP)

Designs > 120 cores impractical for general-purpose server apps

B/W and power envelopes + dataset scaling limit core counts

Physical characteristics

modeled after

• UltraSPARC T2 (GPP)

• ARM11 (EMB)

Page 11: Exploiting Dark Silicon in Server Design

Application Dataset Scaling

© Hardavellas 11

Application datasets scale faster than Moore’s Law! Big Caches

0

2

4

6

8

10

12

14

16

18

20

2004 2007 2010 2013 2016 2019

Scali

ng

Fac

tor

Year of Technology Introduction

OS Dataset Scaling(Muhrvold's Law)

Transistor Scaling(Moore's Law)

TPC Dataset(Historic)

Page 12: Exploiting Dark Silicon in Server Design

Pin Bandwidth Scaling

© Hardavellas 12

Off-chip bandwidth scales slowly (#pins, off-chip clock) Big Caches

1

2

4

8

16

2004 2007 2010 2013 2016 2019

Scali

ng

Fac

tor

Year of Technology Introduction

Transistor Scaling(Moore's Law)

Pin Bandwidth

Page 13: Exploiting Dark Silicon in Server Design

Supply Voltage Scaling

© Hardavellas 13

Supply voltage scaling is SLOW! Dark Silicon

0.5

1

2

4

8

16

2004 2007 2010 2013 2016 2019

Scali

ng

Fac

tor

Year of Technology Introduction

Transistor Scaling(Moore's Law)

Supply Voltage(ITRS)

Page 14: Exploiting Dark Silicon in Server Design

Chip Power Scaling

© Hardavellas 14

Chip power does not scale!

0

50

100

150

200

250

2004 2007 2010 2013 2016 2019

Watt

s / C

hip

Year of Technology Introduction

Max Power(air cooling +heatsink)

Chip Power(ITRS)

Page 15: Exploiting Dark Silicon in Server Design

Range of Operational Voltage

© Hardavellas 15

[Watanabe et al., ISCA’10]

Shrinking range of operational voltage hampers voltage-freq. scaling

Page 16: Exploiting Dark Silicon in Server Design

Mitigating Bandwidth Limitations: 3D-stacking

© Hardavellas 16

[Loh et al., ISCA’08]

Delivers TB/sec of bandwidth; use as large “in-package” cache

[Amcor Tech]

[Philips]

Page 17: Exploiting Dark Silicon in Server Design

Performance Analysis of 3D-Stacked Multicores

© Hardavellas 17

0

100

200

300

400

500

600

700

800

1 2 4 8 16 32 64 128 256 512

Perf

orm

an

ce (

GIP

S)

Cache Size (MB)

Area (max freq)

Power (max freq)

Bandwidth, VFS

Area+Power, VFS

Area+P+BW, VFS

Chip becomes power-constrained

Page 18: Exploiting Dark Silicon in Server Design

Exponentially Large Die Area Left Unutilized

© Hardavellas 18

64

128

256

512

2004 2007 2010 2013 2016 2019

Die

Siz

e (

mm

2)

Year of Technology Introduction

Max Die Size

DB2-TPCC

DB2-TPCH

Apache

Trendline (exp.)

Dark Silicon!!! Should we waste it?

Page 19: Exploiting Dark Silicon in Server Design

Example of a Specialized Multicore Chip

© Hardavellas 19

ILP OoO

Core

SIMD

DSP

Many Threads

Reconfigurable

SIMD SIMD

SIMD

Many

Threads

ILP OoO

Core

Crypto

Many custom cores on chip; power only the most useful ones

TCP

Page 20: Exploiting Dark Silicon in Server Design

Core Specialization • Existing general designs

OoO for ILP, in-order MT for memory-latency-bound, SIMD for data-parallel, systolic arrays

• Customizable cores

Tensilica Xtensa (custom ISA and datapath, operation fusion)

• Reconfigurable logic

• Generality of implemented operations

Target specific application

Common macro-operations

General ISA

• Trade-offs in performance, power, programmability, generality

© Hardavellas 20

Wide range of “heterogeneity” and “specialization” meanings

Page 21: Exploiting Dark Silicon in Server Design

First-Order Core Specialization Model

• 720p HD H.264 encoder (high-definition video encoder)

• Several optimized implementations exist

Commercial ASICs, FPGAs, CMP software

• Wide range of computational motifs

© Hardavellas 21

Framesper sec

Energy per frame (mJ)

Performance gap with ASIC

Energy gap with ASIC

ASIC 30 4

CMP

IME 0.06 1179 525x 707x

FME 0.08 921 342x 468x

Intra 0.48 137 63x 157x

CABAC 1.82 39 17x 261x

[Hameed et al., ASPLOS’10]

Page 22: Exploiting Dark Silicon in Server Design

Performance of Specialized Multicores

© Hardavellas 22

1

2

4

8

16

32

64

2004 2007 2010 2013 2016 2019

Sp

ee

du

p

Year of Technology Introduction

Ideal-P + 3D mem

Ideal-P

EMB + 3D mem

EMB

GPP + 3D mem

GPP

Specialized multicores deliver 2x-12x higher performance

Page 23: Exploiting Dark Silicon in Server Design

Core Counts for Specialized Multicores

© Hardavellas 23

1

2

4

8

16

32

64

128

256

512

1024

2004 2007 2010 2013 2016 2019

Nu

mb

er

of

Co

res

Year of Technology Introduction

Max EMB Cores

EMB + 3D mem

EMB

GPP + 3D mem

Only few cores need to run at a time; large die area allow many cores

Power constraints? Yield?

Page 24: Exploiting Dark Silicon in Server Design

Taming Power and Bandwidth : Nanophotonics

© Hardavellas 24

Optical

Interconnect

Split chip into chiplets, spread in space

Ease cooling and power delivery, high yield; photonics for bandwidth

Page 25: Exploiting Dark Silicon in Server Design

Nanophotonic Components

© Hardavellas 25

Ge-doped

Rings selectively couple optical energy of a specific wavelength 110100101

110100101

64 wavelengths DWDM, 3 ~ 5μm waveguide pitch, 10Gbps per link

~100 Gbps/μm bandwidth density !!! [Batten et al., HOTI’08]

Page 26: Exploiting Dark Silicon in Server Design

Technology: Off-chip Channel Material

© Hardavellas 26

Material Optical LossPropagation

Speed

Pitch

(density)

Silicon

Waveguide0.3 dB/cm 0.286c 20um

Optic Fiber 0.2 dB/km 0.676c 250um

• Optical fiber is low-loss, high speed

Enables further spreading out chiplets.

BW density was a challenge (fiber pitch size is large)

*

* J. Cardenas et al., Optics Express 2009

Fiber: low optical loss, high speed, flexibility eases assembly

Page 27: Exploiting Dark Silicon in Server Design

Technology: Dense Off-Chip Coupling

© Hardavellas 27

• Dense optical fiber array. [Lee et al., OSA / OFC/NFOEC 2010]

• <1dB loss, 8 Tbps/mm demonstrated.

Tapered couplers solved bandwidth problem, demonstrated Tbps/mm

Page 28: Exploiting Dark Silicon in Server Design

Galaxy Overall Architecture

© Hardavellas 28

Chiplet 1 Chiplet 0src

Chiplet 3

Chiplet 2

Chiplet 4

Cross-chiplet assemblies share an optical bus,

forming optical crossbars (FlexiShare)

Chiplet 0

Chiplet 3

Laser

Source

couplers

Optical fiber

Electrical

cluster

dst

Page 29: Exploiting Dark Silicon in Server Design

Large-Scale Interconnects

© Hardavellas 29

200mm2 die, 64 routers per chiplet, 9 chiplets, 16cm fiber

Supports > 1K cores!

Page 30: Exploiting Dark Silicon in Server Design

Conclusions

• Physical constraints and software pragmatics limit core counts

…and performance

• Emerging/exotic technologies may solve some problems

3D-memory for bandwidth

Nanophotonics for bandwidth, power, yield

• Need to reduce wasted energy per unit of work

Heterogeneity, only power the few cores needed

• Need to innovate across software/hardware stack

Programmability, tools are a great challenge

• Scaling forces caches to grow exponentially

Address data management both at cache and software

© Hardavellas 30

Page 31: Exploiting Dark Silicon in Server Design

Thank You!

Acknowledgements:

Y. Pan, J. Kim, G. Memik, M. Ferdman, B. Falsafi

© Hardavellas 31