Rethinking the Systems We Design · My Takeaways Quite reasonable principles Stated by other principled thinkers in similar or different ways E.g., Mike Flynn, “Very High-Speed

Rethinking

the Systems We Design

Onur Mutlu

[email protected]

September 19, 2014

Yale @ 75

mailto:[email protected]

Agenda

Principled Computer Architecture/System Design

How We Violate Those Principles Today

Some Solution Approaches

Concluding Remarks

2

First, Let’s Start With …

The Real Reason We Are Here Today

Yale @ 35

3

4

Some Teachings of Yale Patt

5

Design Principles

• Critical path design

• Bread and Butter design

• Balanced design

from Yale Patt’s EE 382N lecture notes

(Micro)architecture Design Principles

Bread and butter design

Spend time and resources on where it matters (i.e. improving what the machine is designed to do)

Common case vs. uncommon case

Balanced design

Balance instruction/data flow through uarch components

Design to eliminate bottlenecks

Critical path design

Find the maximum speed path and decrease it

Break a path into multiple cycles?

7

from my ECE 740 lecture notes

My Takeaways

Quite reasonable principles

Stated by other principled thinkers in similar or different ways

E.g., Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966

E.g., Gene M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS Conference, April 1967.

E.g., Butler W. Lampson, “Hints for Computer System Design,” SOSP 1983.

…

Will take the liberty to generalize them in the rest of the talk

8

The Problem

Systems designed today violate these principles

Some system components individually might not (or might seem not to) violate the principles

But the overall system

Does not spend time or resources where it matters

Is grossly imbalanced

Does not optimize for the critical work/application

9

Agenda




Concluding Remarks

10

A Computing System

Three key components

Computation

Communication

Storage/memory

11

Today’s Systems

Are overwhelmingly processor centric

Processor is heavily optimized and is considered the master

Many system-level tradeoffs are constrained or dictated by the processor – all data processed in the processor

Data storage units are dumb slaves and are largely unoptimized (except for some that are on the processor die)

12

Yet …

“It’s the memory, stupid” (Anonymous DEC engineer)

13

05

101520253035404550556065707580859095

100

128-entry window

No

rma

lize

d E

xec

uti

on

Tim

e

Non-stall (compute) time

Full-window stall time

L2 Misses

Data from Runahead Execution [HPCA 2003]

Yet …

Memory system is the major performance, energy, QoS/predictability and reliability bottleneck in many (most?) workloads

And, it is becoming increasingly so

Increasing hunger for more data and its (fast) analysis

Demand to pack and consolidate more on-chip for efficiency

Memory bandwidth and capacity not scaling as fast as demand

Demand to guarantee SLAs, QoS, user satisfaction

DRAM technology is not scaling well to smaller feature sizes, exacerbating energy, reliability, capacity, bandwidth problems

14

This Processor-Memory Disparity

Leads to designs that

do not spend time or resources where it matters

are grossly imbalanced

do not optimize for the critical work/application

Processor becomes overly complex and bloated

To tolerate memory related issues

Complex hierarchies are built just to move and store data within the processor

“The forgotten” memory system becomes dumb and inadequate in many aspects

15

Several Examples

Bulk data copy (and initialization)

DRAM refresh

Memory reliability

Disparity of working memory and persistent storage

Homogeneous memory

Predictable performance and fairness in memory

16

Today’s Memory: Bulk Data Copy

Memory

MC L3 L2 L1 CPU

1) High latency

2) High bandwidth utilization

3) Cache pollution

4) Unwanted data movement

17 1046ns, 3.6uJ (for 4KB page copy via DMA)

Future: RowClone (In-Memory Copy)

Memory

MC L3 L2 L1 CPU

1) Low latency

2) Low bandwidth utilization

3) No cache pollution

4) No unwanted data movement

18 1046ns, 3.6uJ 90ns, 0.04uJ

DRAM Subarray Operation (load one byte)

Row Buffer (4 Kbits)

Data Bus

8 bits

DRAM array

4 Kbits

Step 1: Activate row

Transfer

row

Step 2: Read

Transfer byte

onto bus

RowClone: In-DRAM Row Copy

Row Buffer (4 Kbits)

Data Bus

8 bits

DRAM array

4 Kbits

Step 1: Activate row A

Transfer

row

Step 2: Activate row B

Transfer

row 0.01% area cost

RowClone: Latency and Energy Savings

0

0.2

0.4

0.6

0.8

1

1.2

Latency Energy

No

rmal

ize

d S

avin

gs

Baseline Intra-Subarray

Inter-Bank Inter-Subarray

11.6x 74x

21 Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.

End-to-End System Design

22

DRAM (RowClone)

Microarchitecture

ISA

Operating System

Application How does the software communicate occurrences of bulk copy/initialization to hardware?

How to maximize latency and energy savings?

How to ensure data coherence?

How to handle data reuse?

RowClone: Overall Performance

23

0

10

20

30

40

50

60

70

80

bootup compile forkbench mcached mysql shell

% C

om

pare

d t

o B

aseli

ne

IPC Improvement Energy Reduction

RowClone: Multi-Core Performance

24

0.9

1

1.1

1.2

1.3

1.4

1.5

No

rma

lize

d W

eig

hte

d S

pe

ed

up

50 Workloads (4-core)

Baseline RowClone

Goal: Ultra-Efficient Processing Near Data

CPU core

CPU core

CPU core

CPU core

mini-CPU core

video core

GPU (throughput)

core

GPU (throughput)

core

GPU (throughput)

core

GPU (throughput)

core

LLC

Memory Controller

Specialized compute-capability

in memory

Memory imaging core

Memory Bus

Memory similar to a “conventional” accelerator

Enabling Ultra-Efficient Search

▪ What is the right partitioning of computation

capability?

▪ What is the right low-cost memory substrate?

▪ What memory technologies are the best

enablers?

▪ How do we rethink/ease (visual) search

algorithms/applications?

Cache

Processor Core

Interconnect

Memory

Database

Query vector

Results

Several Examples


DRAM refresh

Memory reliability


Homogeneous memory

Memory QoS and predictable performance

27

DRAM Refresh

DRAM capacitor charge leaks over time

The memory controller needs to refresh each row periodically to restore charge

Activate each row every N ms

Typical N = 64 ms

Downsides of refresh

-- Energy consumption: Each refresh consumes energy

-- Performance degradation: DRAM rank/bank unavailable while refreshed

-- QoS/predictability impact: (Long) pause times during refresh

-- Refresh rate limits DRAM capacity scaling

28

Refresh Overhead: Performance

29

8%

46%

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Refresh Overhead: Energy

30

15%

47%

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Retention Time Profile of DRAM

31

RAIDR: Eliminating Unnecessary Refreshes

Observation: Most DRAM rows can be refreshed much less often without losing data [Kim+, EDL’09][Liu+ ISCA’13]

Key idea: Refresh rows containing weak cells

more frequently, other rows less frequently

1. Profiling: Profile retention time of all rows

2. Binning: Store rows into bins by retention time in memory controller

Efficient storage with Bloom Filters (only 1.25KB for 32GB memory)

3. Refreshing: Memory controller refreshes rows in different bins at different rates

Results: 8-core, 32GB, SPEC, TPC-C, TPC-H

74.6% refresh reduction @ 1.25KB storage

~16%/20% DRAM dynamic/idle power reduction

~9% performance improvement

Benefits increase with DRAM capacity

32 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Several Examples


DRAM refresh

Memory reliability


Homogeneous memory


33

The DRAM Scaling Problem

DRAM stores charge in a capacitor (charge-based memory)

Capacitor must be large enough for reliable sensing

Access transistor should be large enough for low leakage and high retention time

Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

DRAM capacity, cost, and energy/power hard to scale

34

The DRAM Scaling Problem

DRAM scaling has become a real problem the system should be concerned about

And, maybe embrace

35

Row of Cells Row Row Row Row

Wordline

VLOW VHIGH Victim Row

Victim Row Aggressor Row

Repeatedly opening and closing a row induces disturbance errors in adjacent rows in most real DRAM chips [Kim+ ISCA 2014]

Opened Closed

36

An Example of The Scaling Problem

Most DRAM Modules Are at Risk

86% (37/43)

83% (45/54)

88% (28/32)

A company B company C company

Up to

1.0×107 errors

Up to

2.7×106 errors

Up to

3.3×105 errors

37 Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.

DRAM Module x86 CPU

Y

X

loop:

mov (X), %eax

mov (Y), %ebx

clflush (X)

clflush (Y)

mfence

jmp loop

DRAM Module x86 CPU

loop:

mov (X), %eax

mov (Y), %ebx

clflush (X)

clflush (Y)

mfence

jmp loop

Y

X

DRAM Module x86 CPU

loop:

mov (X), %eax

mov (Y), %ebx

clflush (X)

clflush (Y)

mfence

jmp loop

Y

X

DRAM Module x86 CPU

loop:

mov (X), %eax

mov (Y), %ebx

clflush (X)

clflush (Y)

mfence

jmp loop

Y

X

Observed Errors in Real Systems

• In a more controlled environment, we can induce as many as ten million disturbance errors

• Disturbance errors are a serious reliability issue

CPU Architecture Errors Access-Rate

Intel Haswell (2013) 22.9K 12.3M/sec

Intel Ivy Bridge (2012) 20.7K 11.7M/sec

Intel Sandy Bridge (2011) 16.1K 11.6M/sec

AMD Piledriver (2012) 59 6.1M/sec

42 Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.

How Do We Solve The Problem?

Tolerate it: Make DRAM and controllers more intelligent

Just like flash memory and hard disks

Eliminate or minimize it: Replace or (more likely) augment DRAM with a different technology

Embrace it: Design heterogeneous-reliability memories that map error-tolerant data to less reliable portions

…

43

App/Data A App/Data B App/Data C

Mem

ory

err

or

vuln

erab

ility

Vulnerable data

Tolerant data

Exploiting Memory Error Tolerance

Heterogeneous-Reliability Memory

Low-cost memory Reliable memory

Vulnerable data

Tolerant data

Vulnerable data

Tolerant data

• ECC protected • Well-tested chips

• NoECC or Parity • Less-tested chips

44

On Microsoft’s Web Search application Reduces server hardware cost by 4.7 % Achieves single server availability target of 99.90 %

Several Examples


DRAM refresh

DRAM reliability


Homogeneous memory


45

Agenda




Concluding Remarks

46

Some Directions for the Future

We need to rethink the entire memory/storage system

Satisfy data-intensive workloads

Fix many DRAM issues (energy, reliability, …)

Enable emerging technologies

Enable a better overall system design

We need to find a better balance between moving data versus moving computation

Minimize system energy and bandwidth

Maximize system performance and efficiency

We need to enable system-level memory/storage QoS

Provide predictable performance

Build controllable and robust systems

47

Some Solution Principles (So Far)

More data-centric system design

Do not center everything around computation units

Better cooperation across layers of the system

Careful co-design of components and layers: system/arch/device

More flexible interfaces

Better-than-worst-case design

Do not optimize for the worst case

Worst case should not determine the common case

Heterogeneity in design (specialization, asymmetry)

Enables a more efficient design (No one size fits all)

48

Agenda




Concluding Remarks

49

Role of the Architect


A Quote from Another Famous Architect

“architecture […] based upon principle, and not upon precedent”

51

Concluding Remarks

It is time to design systems to be more balanced, i.e., more memory-centric

It is time to make memory/storage a priority in system design and optimize it & integrate it better into the system

It is time to design systems to be more focused on critical pieces of work

Future systems will/should be data-centric and memory-centric, with appropriate attention to principles

52

Finally, people are always telling you:

Think outside the box


I prefer: Expand the box


Rethinking

the Systems We Design

Onur Mutlu

[email protected]

September 19, 2014

Yale @ 75

mailto:[email protected]

Rethinking the Systems We Design · My Takeaways Quite reasonable principles Stated by other principled thinkers in similar or different ways E.g., Mike Flynn, “Very High-Speed

Documents