Rethinking
the Systems We Design
Onur Mutlu
September 19, 2014
Yale @ 75
Agenda
Principled Computer Architecture/System Design
How We Violate Those Principles Today
Some Solution Approaches
Concluding Remarks
2
First, Let’s Start With …
The Real Reason We Are Here Today
Yale @ 35
3
4
Some Teachings of Yale Patt
5
Design Principles
• Critical path design
• Bread and Butter design
• Balanced design
from Yale Patt’s EE 382N lecture notes
(Micro)architecture Design Principles
Bread and butter design
Spend time and resources on where it matters (i.e. improving what the machine is designed to do)
Common case vs. uncommon case
Balanced design
Balance instruction/data flow through uarch components
Design to eliminate bottlenecks
Critical path design
Find the maximum speed path and decrease it
Break a path into multiple cycles?
7
from my ECE 740 lecture notes
My Takeaways
Quite reasonable principles
Stated by other principled thinkers in similar or different ways
E.g., Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966
E.g., Gene M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS Conference, April 1967.
E.g., Butler W. Lampson, “Hints for Computer System Design,” SOSP 1983.
…
Will take the liberty to generalize them in the rest of the talk
8
The Problem
Systems designed today violate these principles
Some system components individually might not (or might seem not to) violate the principles
But the overall system
Does not spend time or resources where it matters
Is grossly imbalanced
Does not optimize for the critical work/application
9
Agenda
Principled Computer Architecture/System Design
How We Violate Those Principles Today
Some Solution Approaches
Concluding Remarks
10
A Computing System
Three key components
Computation
Communication
Storage/memory
11
Today’s Systems
Are overwhelmingly processor centric
Processor is heavily optimized and is considered the master
Many system-level tradeoffs are constrained or dictated by the processor – all data processed in the processor
Data storage units are dumb slaves and are largely unoptimized (except for some that are on the processor die)
12
Yet …
“It’s the memory, stupid” (Anonymous DEC engineer)
13
05
101520253035404550556065707580859095
100
128-entry window
No
rma
lize
d E
xec
uti
on
Tim
e
Non-stall (compute) time
Full-window stall time
L2 Misses
Data from Runahead Execution [HPCA 2003]
Yet …
Memory system is the major performance, energy, QoS/predictability and reliability bottleneck in many (most?) workloads
And, it is becoming increasingly so
Increasing hunger for more data and its (fast) analysis
Demand to pack and consolidate more on-chip for efficiency
Memory bandwidth and capacity not scaling as fast as demand
Demand to guarantee SLAs, QoS, user satisfaction
DRAM technology is not scaling well to smaller feature sizes, exacerbating energy, reliability, capacity, bandwidth problems
14
This Processor-Memory Disparity
Leads to designs that
do not spend time or resources where it matters
are grossly imbalanced
do not optimize for the critical work/application
Processor becomes overly complex and bloated
To tolerate memory related issues
Complex hierarchies are built just to move and store data within the processor
“The forgotten” memory system becomes dumb and inadequate in many aspects
15
Several Examples
Bulk data copy (and initialization)
DRAM refresh
Memory reliability
Disparity of working memory and persistent storage
Homogeneous memory
Predictable performance and fairness in memory
16
Today’s Memory: Bulk Data Copy
Memory
MC L3 L2 L1 CPU
1) High latency
2) High bandwidth utilization
3) Cache pollution
4) Unwanted data movement
17 1046ns, 3.6uJ (for 4KB page copy via DMA)
Future: RowClone (In-Memory Copy)
Memory
MC L3 L2 L1 CPU
1) Low latency
2) Low bandwidth utilization
3) No cache pollution
4) No unwanted data movement
18 1046ns, 3.6uJ 90ns, 0.04uJ
DRAM Subarray Operation (load one byte)
Row Buffer (4 Kbits)
Data Bus
8 bits
DRAM array
4 Kbits
Step 1: Activate row
Transfer
row
Step 2: Read
Transfer byte
onto bus
RowClone: In-DRAM Row Copy
Row Buffer (4 Kbits)
Data Bus
8 bits
DRAM array
4 Kbits
Step 1: Activate row A
Transfer
row
Step 2: Activate row B
Transfer
row 0.01% area cost
RowClone: Latency and Energy Savings
0
0.2
0.4
0.6
0.8
1
1.2
Latency Energy
No
rmal
ize
d S
avin
gs
Baseline Intra-Subarray
Inter-Bank Inter-Subarray
11.6x 74x
21 Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.
End-to-End System Design
22
DRAM (RowClone)
Microarchitecture
ISA
Operating System
Application How does the software communicate occurrences of bulk copy/initialization to hardware?
How to maximize latency and energy savings?
How to ensure data coherence?
How to handle data reuse?
RowClone: Overall Performance
23
0
10
20
30
40
50
60
70
80
bootup compile forkbench mcached mysql shell
% C
om
pare
d t
o B
aseli
ne
IPC Improvement Energy Reduction
RowClone: Multi-Core Performance
24
0.9
1
1.1
1.2
1.3
1.4
1.5
No
rma
lize
d W
eig
hte
d S
pe
ed
up
50 Workloads (4-core)
Baseline RowClone
Goal: Ultra-Efficient Processing Near Data
CPU core
CPU core
CPU core
CPU core
mini-CPU core
video core
GPU (throughput)
core
GPU (throughput)
core
GPU (throughput)
core
GPU (throughput)
core
LLC
Memory Controller
Specialized compute-capability
in memory
Memory imaging core
Memory Bus
Memory similar to a “conventional” accelerator
Enabling Ultra-Efficient Search
▪ What is the right partitioning of computation
capability?
▪ What is the right low-cost memory substrate?
▪ What memory technologies are the best
enablers?
▪ How do we rethink/ease (visual) search
algorithms/applications?
Cache
Processor Core
Interconnect
Memory
Database
Query vector
Results
Several Examples
Bulk data copy (and initialization)
DRAM refresh
Memory reliability
Disparity of working memory and persistent storage
Homogeneous memory
Memory QoS and predictable performance
27
DRAM Refresh
DRAM capacitor charge leaks over time
The memory controller needs to refresh each row periodically to restore charge
Activate each row every N ms
Typical N = 64 ms
Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while refreshed
-- QoS/predictability impact: (Long) pause times during refresh
-- Refresh rate limits DRAM capacity scaling
28
Refresh Overhead: Performance
29
8%
46%
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
Refresh Overhead: Energy
30
15%
47%
Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
Retention Time Profile of DRAM
31
RAIDR: Eliminating Unnecessary Refreshes
Observation: Most DRAM rows can be refreshed much less often without losing data [Kim+, EDL’09][Liu+ ISCA’13]
Key idea: Refresh rows containing weak cells
more frequently, other rows less frequently
1. Profiling: Profile retention time of all rows
2. Binning: Store rows into bins by retention time in memory controller
Efficient storage with Bloom Filters (only 1.25KB for 32GB memory)
3. Refreshing: Memory controller refreshes rows in different bins at different rates
Results: 8-core, 32GB, SPEC, TPC-C, TPC-H
74.6% refresh reduction @ 1.25KB storage
~16%/20% DRAM dynamic/idle power reduction
~9% performance improvement
Benefits increase with DRAM capacity
32 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
Several Examples
Bulk data copy (and initialization)
DRAM refresh
Memory reliability
Disparity of working memory and persistent storage
Homogeneous memory
Memory QoS and predictable performance
33
The DRAM Scaling Problem
DRAM stores charge in a capacitor (charge-based memory)
Capacitor must be large enough for reliable sensing
Access transistor should be large enough for low leakage and high retention time
Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]
DRAM capacity, cost, and energy/power hard to scale
34
The DRAM Scaling Problem
DRAM scaling has become a real problem the system should be concerned about
And, maybe embrace
35
Row of Cells Row Row Row Row
Wordline
VLOW VHIGH Victim Row
Victim Row Aggressor Row
Repeatedly opening and closing a row induces disturbance errors in adjacent rows in most real DRAM chips [Kim+ ISCA 2014]
Opened Closed
36
An Example of The Scaling Problem
Most DRAM Modules Are at Risk
86% (37/43)
83% (45/54)
88% (28/32)
A company B company C company
Up to
1.0×107 errors
Up to
2.7×106 errors
Up to
3.3×105 errors
37 Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.
DRAM Module x86 CPU
Y
X
loop:
mov (X), %eax
mov (Y), %ebx
clflush (X)
clflush (Y)
mfence
jmp loop
DRAM Module x86 CPU
loop:
mov (X), %eax
mov (Y), %ebx
clflush (X)
clflush (Y)
mfence
jmp loop
Y
X
DRAM Module x86 CPU
loop:
mov (X), %eax
mov (Y), %ebx
clflush (X)
clflush (Y)
mfence
jmp loop
Y
X
DRAM Module x86 CPU
loop:
mov (X), %eax
mov (Y), %ebx
clflush (X)
clflush (Y)
mfence
jmp loop
Y
X
Observed Errors in Real Systems
• In a more controlled environment, we can induce as many as ten million disturbance errors
• Disturbance errors are a serious reliability issue
CPU Architecture Errors Access-Rate
Intel Haswell (2013) 22.9K 12.3M/sec
Intel Ivy Bridge (2012) 20.7K 11.7M/sec
Intel Sandy Bridge (2011) 16.1K 11.6M/sec
AMD Piledriver (2012) 59 6.1M/sec
42 Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.
How Do We Solve The Problem?
Tolerate it: Make DRAM and controllers more intelligent
Just like flash memory and hard disks
Eliminate or minimize it: Replace or (more likely) augment DRAM with a different technology
Embrace it: Design heterogeneous-reliability memories that map error-tolerant data to less reliable portions
…
43
App/Data A App/Data B App/Data C
Mem
ory
err
or
vuln
erab
ility
Vulnerable data
Tolerant data
Exploiting Memory Error Tolerance
Heterogeneous-Reliability Memory
Low-cost memory Reliable memory
Vulnerable data
Tolerant data
Vulnerable data
Tolerant data
• ECC protected • Well-tested chips
• NoECC or Parity • Less-tested chips
44
On Microsoft’s Web Search application Reduces server hardware cost by 4.7 % Achieves single server availability target of 99.90 %
Several Examples
Bulk data copy (and initialization)
DRAM refresh
DRAM reliability
Disparity of working memory and persistent storage
Homogeneous memory
Memory QoS and predictable performance
45
Agenda
Principled Computer Architecture/System Design
How We Violate Those Principles Today
Some Solution Approaches
Concluding Remarks
46
Some Directions for the Future
We need to rethink the entire memory/storage system
Satisfy data-intensive workloads
Fix many DRAM issues (energy, reliability, …)
Enable emerging technologies
Enable a better overall system design
We need to find a better balance between moving data versus moving computation
Minimize system energy and bandwidth
Maximize system performance and efficiency
We need to enable system-level memory/storage QoS
Provide predictable performance
Build controllable and robust systems
47
Some Solution Principles (So Far)
More data-centric system design
Do not center everything around computation units
Better cooperation across layers of the system
Careful co-design of components and layers: system/arch/device
More flexible interfaces
Better-than-worst-case design
Do not optimize for the worst case
Worst case should not determine the common case
Heterogeneity in design (specialization, asymmetry)
Enables a more efficient design (No one size fits all)
48
Agenda
Principled Computer Architecture/System Design
How We Violate Those Principles Today
Some Solution Approaches
Concluding Remarks
49
Role of the Architect
from Yale Patt’s EE 382N lecture notes
A Quote from Another Famous Architect
“architecture […] based upon principle, and not upon precedent”
51
Concluding Remarks
It is time to design systems to be more balanced, i.e., more memory-centric
It is time to make memory/storage a priority in system design and optimize it & integrate it better into the system
It is time to design systems to be more focused on critical pieces of work
Future systems will/should be data-centric and memory-centric, with appropriate attention to principles
52
Finally, people are always telling you:
Think outside the box
from Yale Patt’s EE 382N lecture notes
I prefer: Expand the box
from Yale Patt’s EE 382N lecture notes
Rethinking
the Systems We Design
Onur Mutlu
September 19, 2014
Yale @ 75