2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

2014-3-6

John Lazzaro(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 14 - Cache Design and Coherence

Play:

Today: Shared Cache Design and Coherence

Crossbars and RingsHow to do on-chip

sharing.Concurrent requests

Interfaces that don’t stall.

CPU multi-threadingKeeps memory system

busy.

Coherency ProtocolsBuilding coherent caches.

CPU

Private Cache

Shared Ports

...

...

Shared Caches

DRAM

CPU

Private Cache

I/O


Multithreading

Sun Microsystems Niagara series

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

The case for multithreadin

g

Amdahl’s Law tells us

that optimizing C is the wrong

thing to do ...

Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of a large number of threads.

Some applications spend their

lives waiting for memory.C = compute M = waiting


Multi-threading: Assuming perfect caches

4 CPUs,running @ 1/4 clock.

S. Cray, 1962.Labels show this

state:

T3 T2 T1T4


Mux,Logic

Bypass network is no longer needed ...

IR IR

B

A

M

IR

Y

M

IR

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

Result: Critical path shortens -- can trade for speed or power.


Multi-threading: Supporting cache misses

Thread scheduler

A thread scheduler keeps track of information about all threads that share

pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss

penalty period.


Sun Niagara II # threads/core?

8 threads/core: Enough to keep one core busy, given clock speed, memory system

latency, and target application characteristics.


Crossbar Networks

Shared-memory

CPU

Private Cache

Shared Ports

...

...

CPUs share lower level of memory system, and I/O.

Common address space, one operating

system image.

Communication occurs through the

memory system (100ns latency,

20 GB/s bandwidth)

Shared Caches

DRAM

CPU

Private Cache

I/O

Sun’s Niagara II: Single-chip implementation ...

SPC == SPARC Core. Only DRAM is not on chip.


Crossbar: Like N ports on an N-register file

R1

R2

...

R31

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

32MUX

32

32

sel(rs1)

5...

rd1

32MUX

32

32

sel(rs2)

5...

rd2

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)

5

WE

wd

32

Flexible, but ... reads slows down as O(N2) ...

Why? Number of loads on each Q goes as O(N), and the wire length

to port mux goes as O(N).

Design challenge: High-performance crossbar

Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read +

Write).

Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels.Apps are locality-poor. Goal: saturate DRAM

BW.


Sun Niagara II 8 x 9 CrossbarEvery cross of blue and purple is a tri-statebuffer with a unique control signal.

72 control signals (if distributed unencoded).

Tri-state distributed mux, as in microcode talk.


Sun Niagara II

8 x 9 Crossbar

8 ports on CPU side (one per core)

8 ports for L2 banks, plus one for I/0

4 cycle latency (715ps/cycle). Cycles 1-3 are for arbitration.

Transmit data on cycle 4.

100-200 wires/ port (each way).

Pipelined.


A complete switch transfer (4 epochs)

Epoch 1: All input ports (that are ready to send data) request an output port.

Epoch 2: Allocation algorithm decides which inputs get to write.

Epoch 3: Allocation system informs the winning inputs and outputs.

Epoch 4: Actual data transfer takes place.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.

UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers

Epoch 3: The Allocation Problem (4 x 4)

W X Y Z

A 0 0 1 0

B 1 0 0 0

C 0 0 1 0

D 1 0 0 0

Input Ports

(A, B, C, D)

Input Ports

(A, B, C, D)

Output Ports (W, X, Y, Z)

Output Ports (W, X, Y, Z)

A 1 codes that an input has data ready to send to an output.

A 1 codes that an input has data ready to send to an output.

W X Y Z

A 0 0 1 0

B 0 0 0 0

C 0 0 0 0

D 1 0 0 0

Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.

Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.


Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).



Sun Niagara II Energy Facts Crossbar only

1% of total power.


Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Design alternatives to crossbar?


CLOS Networks: From telecom world ...

Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency.


CLOS Networks: An example route

Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking.


Ring Networks

Intel Xeon

Data Center

server chip20% of Intel’s

revenues,

40% of profits.Why? Cloud is growing, Xeon is

dominant.

Compiled Chips

Xeon is a chip family, varying by # of cores, L3 cache

size. Chip family

mask layouts generated

automatically, by adding core/cache

slices.

Ring Bus

Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O.

Chip compiler might size the

ring bus to scale

bandwidth with # of cores.

Ring latency increases

with # of cores. But compared

to baseline, small.

Ring Stop

2.5 MB L3 cache

slice from Xeon E5

Tiles along x-axis are 20 ways of cache

Ring stop interface

lives in the Cache

Control Box

(CBOX)

Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #,

command)Ring Stop #1Ring Stop #2 Ring Stop #3

Data Out

Data In Control

Empty

Ring Stop #2 InterfaceReading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port.

1024

In practice: “Extreme EE” to co-optimize bandwidth, reliability.

Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port.

A derivative of this ring bus is also used on laptop and

desktop chips.


Break

Play:


Hit-over-Miss Caches

Recall: CPU-cache port that doesn’t stall on a miss

From CPU To CPU

Queue 1

Queue 2

CPU makes a request by placing the following items in Queue 1:

CMD: Read, write, etc ...

TAG: 9-bit number identifying the request.

MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.

MADDR: Memory address of first byte.

STORE-DATA: For stores, the data to store.

This cache is used in an ASPIRE CPU (Rocket)

When request is ready, cache places the following items in Queue 2:

TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.

CPU saves info about requests, indexed by TAG.

Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.

From CPU To CPU

Queue 1

Queue 2

Today: How a read request proceeds in L1 D-Cache From CPU To CPU

Queue 1

Queue 2

CPU requestsa read by placing MTYPE, TAG, MADDR in Queue 1.

We do a normal cache access. If there is a hit, we put place load result in Queue 2 ...In the case of a miss, we use the Inverted Miss Status Holding Register.

“We” == L1 D-Cache controller


Inverted MSHR (Miss Status Holding Register)

0

512-entry table, so that every 9-

bit TAG value has an entry.

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

To look up a memory address ...

Hit

(1) Associatively look up block # of memory address in table. If there are no hits, do memory request.

Valid Qualifies Hit

Valid Qualifies Hit

Assumptions: 32-byte blocks, 48-bit physical address space.



0

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042


Hit

TAG (9 bits)

8 0

(2) Index into table using 9-bit TAG, and set all fields using MADDR and MTYPE queue values.

Valid Qualifies Hit

Valid Qualifies Hit

This indexing always finds V=0, because CPU promises not to reuse in-flight tags.






0

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

Hit

(3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L1 cache.

Valid Qualifies Hit

Valid Qualifies Hit






See Farkas and Jouppi on class website, for low-cost designs that are often good enough.

High cost (# comparators + SRAM cells).

We will return to MHSRs to discuss CPI performance later in the semester.

Inverted MHSR notes.

Structural hazards only occur when TAG space is exhausted by the CPU.


Coherency Hardware


Cache Placement


Two CPUs, two caches, shared DRAM ...

CPU0

Cache

Addr Value

CPU1

Shared Main MemoryAddr Value

16

Cache

Addr Value

5

CPU0:

LW R2, 16(R0)

516

CPU1:

LW R2, 16(R0)

16 5

CPU1:SW R0,16(R0)

0

0Write-through caches

View of memory no longer “coherent”.

Loads of location 16 from CPU0 and CPU1 see different values!

Today: What to do ...


The simplest solution ... one cache!

CPU0 CPU1

Shared Main Memory

CPUs do not have internal caches.

Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank Cache

Memory Switch

Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.

In that case, one request is stalled.


Not a complete solution ... good for L2.

CPU0 CPU1

Shared Main Memory

For modern clock rates,access to shared cache through switch takes 10+ cycles.

Shared Multi-Bank Cache

Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.

Sequent Systems (1980s)


Modified form: Private L1s, shared L2

CPU0 CPU1

Shared Main Memory

Thus, we need to solve the cache coherency problem for L1 cache.

Shared Multi-Bank L2 Cache

Memory Switch or Bus

Advantages of shared L2 over private L2s:Processors

communicate at cache speed, not DRAM speed.

L1 Caches L1 Caches

Constructive interference, if both CPUs need same data/instr.Disadvantage:

CPUs share BW to L2 cache ...

IBM Power 4(2001)

Dual core

Shared, multi-bank L2 cache.

Private L1 caches

Off-chip L3 caches


Cache Coherency


Cache coherency goals ...

CPU0

Cache

Addr Value

CPU1

Shared Memory HierarchyAddr Value

16

Cache

Addr Value

5

1. Only one processor at a time has write permission for a memory location.

516 16 5 0

0

2. No processor can load a stale copy of a location after a write.


Simple Implementation: Snoopy Caches

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Each cache has the ability to “snoop” on memory bus transactions of other CPUs.

Cache SnooperMemory bus

The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.


Writes from 10,000 feet ... for write-thru L1

CPU0

Cache Snooper

CPU1



1. Writing CPU takes control of bus.

2. Address to be written is invalidated in all other caches.

3. Write is sent to main memory.

Reads will no longer hit in cache and get stale data.

Reads will cache miss, retrieve new value from main memory

For write-thru caches ...

To a first-order, reads will “just work” if write-thru caches implement this policy.A “two-state” protocol (cache lines are “valid” or “invalid”).


Limitations of the write-thru approach

CPU0

Cache Snooper

CPU1



Every write goes to the bus.

Total bus write bandwidth does not support more than 2 CPUs, in modern practice.

To scale further, we need to use write-back caches.

Write-back big trick: add extra states. Simplest version: MSI -- Modified, Shared, Invalid. More efficient versions add more states (MESI adds Exclusive). State definitions are subtle ...

Figure 5.5, page 358 ... the best starting point.


Read misses ... for a MESI protocol ...

CPU0

Cache Snooper

CPU1



1. A cache requests a cache-line fill for a read miss.

2. Another cache with an exclusive on this line responds with fresh data.

3. The responding cache changes line from exclusive to modified.

Reads miss will not hit main memory, retrieve stale data.

Future writes will go to bus to be snooped..

For write-back caches ...

These sketches are just to give you a sense of how coherency protocols work.Deep understand requires understanding the complete “state machine” for protocol.


Snoopy mechanism doesn’t scale ...

CPU0

Cache Snooper

CPU1


Single-chip implementations have moved to a centralized

“directory” service that tracks the status of each line of each private

cache.


Multi-socket systems use distributed directories.

Directories attached to on-chip cache network ...

2 socket system ... each socket a multi-core chip

Each chip has its own bank of DRAM.

Distributed directories for multi-socket systems

Directories for Chip 0

... and Chip 1

L1

L2

Directory for Chip 1 DRAM.

Directory for Chip 0 DRAM.

L1

L2

Figure 5.21, page 381 ... directory message basics

Conceptually similar to snoopy caches ... but the different mechanisms require rethinking the protocol to get correct

behaviors.


Other Machine Architectures


NUMA: Non-uniform Memory Access

CPU 0

Cache

CPU 1023

Interconnection Network

Each CPU has part of main memory attached to it.

Cache

DRAM DRAM

...

To access other parts of main memory, use the interconnection network.

For best results, applications take the non-uniform memory latency into account.

Network uses coherent global address space. Directory protocols

over fiber networking.


Clusters: Supercomputing version of WSC

Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology (not normal Ethernet).

Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.

University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 μs ping time - low latency network).

On Tuesday

We return to CPU design ...

Have a good weekend !

2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Documents

coherencyuc regents

shared cache design

design challenge

cache missescs

input ports

ucbsun niagara

n ports

core8 ports