Top Banner
UC Regents Spring 2014 © UCB CS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/ ~cs152/ TA: Eric Love Lecture 14 - Cache Design and Coherence Pla y:
64

2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Jan 03, 2016

Download

Documents

iona-clements

www-inst.eecs.berkeley.edu/~cs152/. CS 152 Computer Architecture and Engineering. Lecture 14 - Cache Design and Coherence. 2014-3-6 John Lazzaro (not a prof - “John” is always OK). TA: Eric Love. Play:. Today: Shared Cache Design and Coherence. CPU multi-threading - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

2014-3-6

John Lazzaro(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 14 - Cache Design and Coherence

Play:

Page 2: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Today: Shared Cache Design and Coherence

Crossbars and RingsHow to do on-chip

sharing.Concurrent requests

Interfaces that don’t stall.

CPU multi-threadingKeeps memory system

busy.

Coherency ProtocolsBuilding coherent caches.

CPU

Private Cache

Shared Ports

...

...

Shared Caches

DRAM

CPU

Private Cache

I/O

Page 3: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Multithreading

Sun Microsystems Niagara series

Page 4: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

The case for multithreadin

g

Amdahl’s Law tells us

that optimizing C is the wrong

thing to do ...

Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of a large number of threads.

Some applications spend their

lives waiting for memory.C = compute M = waiting

Page 5: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Multi-threading: Assuming perfect caches

4 CPUs,running @ 1/4 clock.

S. Cray, 1962.Labels show this

state:

T3 T2 T1T4

Page 6: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Mux,Logic

Bypass network is no longer needed ...

IR IR

B

A

M

IR

Y

M

IR

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

Result: Critical path shortens -- can trade for speed or power.

Page 7: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Multi-threading: Supporting cache misses

Thread scheduler

A thread scheduler keeps track of information about all threads that share

pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss

penalty period.

Page 8: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II # threads/core?

8 threads/core: Enough to keep one core busy, given clock speed, memory system

latency, and target application characteristics.

Page 9: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Crossbar Networks

Page 10: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Shared-memory

CPU

Private Cache

Shared Ports

...

...

CPUs share lower level of memory system, and I/O.

Common address space, one operating

system image.

Communication occurs through the

memory system (100ns latency,

20 GB/s bandwidth)

Shared Caches

DRAM

CPU

Private Cache

I/O

Page 11: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Sun’s Niagara II: Single-chip implementation ...

SPC == SPARC Core. Only DRAM is not on chip.

Page 12: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar: Like N ports on an N-register file

R1

R2

...

R31

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

32MUX

32

32

sel(rs1)

5...

rd1

32MUX

32

32

sel(rs2)

5...

rd2

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)

5

WE

wd

32

Flexible, but ... reads slows down as O(N2) ...

Why? Number of loads on each Q goes as O(N), and the wire length

to port mux goes as O(N).

Page 13: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Design challenge: High-performance crossbar

Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read +

Write).

Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels.Apps are locality-poor. Goal: saturate DRAM

BW.

Page 14: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II 8 x 9 CrossbarEvery cross of blue and purple is a tri-statebuffer with a unique control signal.

72 control signals (if distributed unencoded).

Tri-state distributed mux, as in microcode talk.

Page 15: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II

8 x 9 Crossbar

8 ports on CPU side (one per core)

8 ports for L2 banks, plus one for I/0

4 cycle latency (715ps/cycle). Cycles 1-3 are for arbitration.

Transmit data on cycle 4.

100-200 wires/ port (each way).

Pipelined.

Page 16: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

A complete switch transfer (4 epochs)

Epoch 1: All input ports (that are ready to send data) request an output port.

Epoch 2: Allocation algorithm decides which inputs get to write.

Epoch 3: Allocation system informs the winning inputs and outputs.

Epoch 4: Actual data transfer takes place.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.

Page 17: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers

Epoch 3: The Allocation Problem (4 x 4)

W X Y Z

A 0 0 1 0

B 1 0 0 0

C 0 0 1 0

D 1 0 0 0

Input Ports

(A, B, C, D)

Input Ports

(A, B, C, D)

Output Ports (W, X, Y, Z)

Output Ports (W, X, Y, Z)

A 1 codes that an input has data ready to send to an output.

A 1 codes that an input has data ready to send to an output.

W X Y Z

A 0 0 1 0

B 0 0 0 0

C 0 0 0 0

D 1 0 0 0

Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.

Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.

Page 18: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Page 19: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Page 20: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II Energy Facts Crossbar only

1% of total power.

Page 21: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Design alternatives to crossbar?

Page 22: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

CLOS Networks: From telecom world ...

Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency.

Page 23: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

CLOS Networks: An example route

Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking.

Page 24: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Ring Networks

Page 25: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Intel Xeon

Data Center

server chip20% of Intel’s

revenues,

40% of profits.Why? Cloud is growing, Xeon is

dominant.

Page 26: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Compiled Chips

Xeon is a chip family, varying by # of cores, L3 cache

size. Chip family

mask layouts generated

automatically, by adding core/cache

slices.

Ring Bus

Page 27: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O.

Chip compiler might size the

ring bus to scale

bandwidth with # of cores.

Ring latency increases

with # of cores. But compared

to baseline, small.

Ring Stop

Page 28: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

2.5 MB L3 cache

slice from Xeon E5

Tiles along x-axis are 20 ways of cache

Ring stop interface

lives in the Cache

Control Box

(CBOX)

Page 29: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #,

command)Ring Stop #1Ring Stop #2 Ring Stop #3

Data Out

Data In Control

Empty

Ring Stop #2 InterfaceReading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port.

1024

Page 30: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

In practice: “Extreme EE” to co-optimize bandwidth, reliability.

Page 31: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port.

Page 32: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

A derivative of this ring bus is also used on laptop and

desktop chips.

Page 33: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Break

Play:

Page 34: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Hit-over-Miss Caches

Page 35: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Recall: CPU-cache port that doesn’t stall on a miss

From CPU To CPU

Queue 1

Queue 2

CPU makes a request by placing the following items in Queue 1:

CMD: Read, write, etc ...

TAG: 9-bit number identifying the request.

MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.

MADDR: Memory address of first byte.

STORE-DATA: For stores, the data to store.

Page 36: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

This cache is used in an ASPIRE CPU (Rocket)

When request is ready, cache places the following items in Queue 2:

TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.

CPU saves info about requests, indexed by TAG.

Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.

From CPU To CPU

Queue 1

Queue 2

Page 37: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Today: How a read request proceeds in L1 D-Cache From CPU To CPU

Queue 1

Queue 2

CPU requestsa read by placing MTYPE, TAG, MADDR in Queue 1.

We do a normal cache access. If there is a hit, we put place load result in Queue 2 ...In the case of a miss, we use the Inverted Miss Status Holding Register.

“We” == L1 D-Cache controller

Page 38: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register)

0

512-entry table, so that every 9-

bit TAG value has an entry.

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

To look up a memory address ...

Hit

(1) Associatively look up block # of memory address in table. If there are no hits, do memory request.

Valid Qualifies Hit

Valid Qualifies Hit

Assumptions: 32-byte blocks, 48-bit physical address space.

Page 39: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register)

0

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

To look up a memory address ...

Hit

TAG (9 bits)

8 0

(2) Index into table using 9-bit TAG, and set all fields using MADDR and MTYPE queue values.

Valid Qualifies Hit

Valid Qualifies Hit

This indexing always finds V=0, because CPU promises not to reuse in-flight tags.

512-entry table, so that every 9-

bit TAG value has an entry.

Assumptions: 32-byte blocks, 48-bit physical address space.

Page 40: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register)

0

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

Hit

(3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L1 cache.

Valid Qualifies Hit

Valid Qualifies Hit

To look up a memory address ...

512-entry table, so that every 9-

bit TAG value has an entry.

Assumptions: 32-byte blocks, 48-bit physical address space.

Page 41: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

See Farkas and Jouppi on class website, for low-cost designs that are often good enough.

High cost (# comparators + SRAM cells).

We will return to MHSRs to discuss CPI performance later in the semester.

Inverted MHSR notes.

Structural hazards only occur when TAG space is exhausted by the CPU.

Page 42: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Coherency Hardware

Page 43: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Cache Placement

Page 44: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Two CPUs, two caches, shared DRAM ...

CPU0

Cache

Addr Value

CPU1

Shared Main MemoryAddr Value

16

Cache

Addr Value

5

CPU0:

LW R2, 16(R0)

516

CPU1:

LW R2, 16(R0)

16 5

CPU1:SW R0,16(R0)

0

0Write-through caches

View of memory no longer “coherent”.

Loads of location 16 from CPU0 and CPU1 see different values!

Today: What to do ...

Page 45: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

The simplest solution ... one cache!

CPU0 CPU1

Shared Main Memory

CPUs do not have internal caches.

Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank Cache

Memory Switch

Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.

In that case, one request is stalled.

Page 46: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Not a complete solution ... good for L2.

CPU0 CPU1

Shared Main Memory

For modern clock rates,access to shared cache through switch takes 10+ cycles.

Shared Multi-Bank Cache

Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.

Sequent Systems (1980s)

Page 47: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Modified form: Private L1s, shared L2

CPU0 CPU1

Shared Main Memory

Thus, we need to solve the cache coherency problem for L1 cache.

Shared Multi-Bank L2 Cache

Memory Switch or Bus

Advantages of shared L2 over private L2s:Processors

communicate at cache speed, not DRAM speed.

L1 Caches L1 Caches

Constructive interference, if both CPUs need same data/instr.Disadvantage:

CPUs share BW to L2 cache ...

Page 48: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

IBM Power 4(2001)

Dual core

Shared, multi-bank L2 cache.

Private L1 caches

Off-chip L3 caches

Page 49: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Cache Coherency

Page 50: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Cache coherency goals ...

CPU0

Cache

Addr Value

CPU1

Shared Memory HierarchyAddr Value

16

Cache

Addr Value

5

1. Only one processor at a time has write permission for a memory location.

516 16 5 0

0

2. No processor can load a stale copy of a location after a write.

Page 51: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Simple Implementation: Snoopy Caches

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Each cache has the ability to “snoop” on memory bus transactions of other CPUs.

Cache SnooperMemory bus

The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.

Page 52: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Writes from 10,000 feet ... for write-thru L1

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

1. Writing CPU takes control of bus.

2. Address to be written is invalidated in all other caches.

3. Write is sent to main memory.

Reads will no longer hit in cache and get stale data.

Reads will cache miss, retrieve new value from main memory

For write-thru caches ...

To a first-order, reads will “just work” if write-thru caches implement this policy.A “two-state” protocol (cache lines are “valid” or “invalid”).

Page 53: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Limitations of the write-thru approach

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

Every write goes to the bus.

Total bus write bandwidth does not support more than 2 CPUs, in modern practice.

To scale further, we need to use write-back caches.

Write-back big trick: add extra states. Simplest version: MSI -- Modified, Shared, Invalid. More efficient versions add more states (MESI adds Exclusive). State definitions are subtle ...

Page 54: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Figure 5.5, page 358 ... the best starting point.

Page 55: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Read misses ... for a MESI protocol ...

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

1. A cache requests a cache-line fill for a read miss.

2. Another cache with an exclusive on this line responds with fresh data.

3. The responding cache changes line from exclusive to modified.

Reads miss will not hit main memory, retrieve stale data.

Future writes will go to bus to be snooped..

For write-back caches ...

These sketches are just to give you a sense of how coherency protocols work.Deep understand requires understanding the complete “state machine” for protocol.

Page 56: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Snoopy mechanism doesn’t scale ...

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Single-chip implementations have moved to a centralized

“directory” service that tracks the status of each line of each private

cache.

Cache SnooperMemory bus

Multi-socket systems use distributed directories.

Page 57: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Directories attached to on-chip cache network ...

Page 58: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

2 socket system ... each socket a multi-core chip

Each chip has its own bank of DRAM.

Page 59: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Distributed directories for multi-socket systems

Directories for Chip 0

... and Chip 1

L1

L2

Directory for Chip 1 DRAM.

Directory for Chip 0 DRAM.

L1

L2

Page 60: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

Figure 5.21, page 381 ... directory message basics

Conceptually similar to snoopy caches ... but the different mechanisms require rethinking the protocol to get correct

behaviors.

Page 61: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Other Machine Architectures

Page 62: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

NUMA: Non-uniform Memory Access

CPU 0

Cache

CPU 1023

Interconnection Network

Each CPU has part of main memory attached to it.

Cache

DRAM DRAM

...

To access other parts of main memory, use the interconnection network.

For best results, applications take the non-uniform memory latency into account.

Network uses coherent global address space. Directory protocols

over fiber networking.

Page 63: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Clusters: Supercomputing version of WSC

Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology (not normal Ethernet).

Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.

University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 μs ping time - low latency network).

Page 64: 2014-3-6 John Lazzaro (not a prof - “John” is always OK)

On Tuesday

We return to CPU design ...

Have a good weekend !