CS 152 Computer Architecture and Engineering Lecture 27 ...cs152/sp05/lecnotes/lec15-2.pdfComputer Architecture and Engineering Lecture 27 – Multiprocessors cs152/ TAs: Ted Hong

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-28John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 27 – Multiprocessors

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt


Last Time: Synchronization

Higher Addresses

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 & T3 (2 copes

of consumer thread)

y x

Tail Head

y

Tail Head

After:Before:Higher Addresses

T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

Critical section: T2 and T3 must take turns running red code.


Today: Memory System Design

NUMA and Clusters: Two different ways to build very large computers.

Multiprocessor memory systems: Consequences of cache placement.

Write-through cache coherency: Simple, but limited, approach to multiprocessor memory systems.


Two CPUs, two caches, shared DRAM ...

CPU0

Cache

Addr Value

CPU1

Shared Main MemoryAddr Value16

Cache

Addr Value

5

CPU0:LW R2, 16(R0)

516

CPU1:LW R2, 16(R0)

16 5

CPU1:SW R0,16(R0)

0

0Write-through caches

View of memory no longer “coherent”.

Loads of location 16 from CPU0 and CPU1 see different values!

Today: What to do ...


The simplest solution ... one cache!

CPU0 CPU1

Shared Main Memory

CPUs do not have internal caches.

Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank Cache

Memory Switch

Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.

In that case, one request is stalled.


Not a complete solution ... good for L2.

CPU0 CPU1

Shared Main Memory

For modern clock rates,access to shared cache through switch takes 10+ cycles.

Shared Multi-Bank Cache

Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.

This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.


Modified form: Private L1s, shared L2

CPU0 CPU1

Shared Main Memory

Thus, we need to solve the cache coherency problem for L1 cache.

Shared Multi-Bank L2 Cache

Memory Switch or Bus

Advantages of shared L2 over private L2s:

Processors communicate at cache speed, not DRAM speed.

L1 Caches L1 Caches

Constructive interference, if both CPUs need same data/instr.

Disadvantage: CPUs share BW to L2 cache ...


supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for

Pow

er4-

base

d sy

stem

s pe

rfor

m

equa

lly

wel

l on

Pow

er5-

base

d sy

stem

s. F

igur

e 4

show

s th

ePo

wer

5’s i

nstr

uctio

n flo

w d

iagr

am.

In S

MT

mod

e, th

e Po

wer

5 us

es tw

o se

pa-

rate

inst

ruct

ion

fetc

h ad

dres

s reg

ister

s to

stor

eth

e pr

ogra

m c

ount

ers

for

the

two

thre

ads.

Inst

ruct

ion

fetc

hes

(IF

stag

e)

alte

rnat

ebe

twee

n th

e tw

o th

read

s. I

n ST

mod

e, t

hePo

wer

5 us

es o

nly

one

prog

ram

cou

nter

and

can

fetc

h in

stru

ctio

ns fo

r th

at t

hrea

d ev

ery

cycl

e. I

t ca

n fe

tch

up t

o ei

ght

inst

ruct

ions

from

the

inst

ruct

ion

cach

e (I

C s

tage

) ev

ery

cycl

e. T

he tw

o th

read

s sh

are

the

inst

ruct

ion

cach

e an

d th

e in

stru

ctio

n tr

ansla

tion

faci

lity.

In a

give

n cy

cle,

all f

etch

ed in

stru

ctio

ns co

me

from

the

sam

e th

read

.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).


Sequentially Consistent Memory Systems


Recall: Sequential ConsistencySequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order.

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 code(consumer)

T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

1

2

3

4

Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2 ... but not 2, 3, 1, 4!


Sequential consistency requirements ...

CPU0

Cache

Addr Value

CPU1

Shared Memory HierarchyAddr Value16

Cache

Addr Value

5

1. Only one processor at a time has write permission for a memory location.

516 16 5 0

0

The “sequential” part of sequential consistency.

2. No processor can load a stale copy of a location after a write.The “consistent” part of sequential consistency.


Implementation: Snoopy Caches

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Each cache has the ability to “snoop” on memory bus transactions of other CPUs.

Cache SnooperMemory bus

The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.


Writes from 10,000 feet ...

CPU0

Cache Snooper

CPU1



1. Writing CPU takes control of bus.

2. Address to be written is invalidated in all other caches.

3. Write is sent to main memory.

Reads will no longer hit in cache and get stale data.

Reads will cache miss, retrieve new value from main memory

For write-thru caches ...

To a first-order, reads will “just work” if write-thru caches implement this policy.

A “two-state” protocol (cache lines are “valid” or “invalid”).


Limitations of the write-thru approach

CPU0

Cache Snooper

CPU1



Every write goes to the bus.

Total bus write bandwidth does not support more than 2 CPUs, in modern practice.

To scale further, we need to use write-back caches.

Write-back big trick: keep track of whether other caches also contain a cached line. If not, a cache has an “exclusive” on the line, and can read and write the line as if it were the only CPU.For details, take CS 252 ...


Other Machine Architectures


NUMA: Non-uniform Memory Access

CPU 0

Cache

CPU 1023

Interconnection Network

Each CPU has part of main memory attached to it.

Cache

DRAM DRAM

...

To access other parts of main memory, use the interconnection network.

For best results, applications take the non-uniform memory latency into account.

Good for applications that match the machine model ...


Clusters ... an application-level approachRack Stats

• Weight: 1500 Lbs

• Power: 98 Amps

• Fans: 340 (2”) + 2 (8”)

• Wire: 0.25 miles

• Assembly and wiring time: 60 man-hours

Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology.

Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.

University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 µs ping time - low latency network).


Clusters also used for web serversIn some applications, each machine can handle a net query by itself.

Example: serving static web pages. Each machine has a copy of the website.

but I intentionally ignore them here because theyare well studied elsewhere and because the issuesin this article are largely orthogonal to the use ofdatabases.

AdvantagesThe basic model that giant-scale services followprovides some fundamental advantages:

! Access anywhere, anytime. A ubiquitous infra-structure facilitates access from home, work,airport, and so on.

! Availability via multiple devices. Because theinfrastructure handles most of the processing,users can access services with devices such asset-top boxes, network computers, and smartphones, which can offer far more functionali-ty for a given cost and battery life.

! Groupware support. Centralizing data frommany users allows service providers to offergroup-based applications such as calendars, tele-conferencing systems, and group-managementsystems such as Evite (http://www.evite.com/).

! Lower overall cost. Although hard to measure,infrastructure services have a fundamental costadvantage over designs based on stand-alonedevices. Infrastructure resources can be multi-plexed across active users, whereas end-userdevices serve at most one user (active or not).Moreover, end-user devices have very low uti-lization (less than 4 percent), while infrastruc-ture resources often reach 80 percent utiliza-tion. Thus, moving anything from the deviceto the infrastructure effectively improves effi-ciency by a factor of 20. Centralizing theadministrative burden and simplifying enddevices also reduce overall cost, but are harderto quantify.

! Simplified service updates. Perhaps the mostpowerful long-term advantage is the ability toupgrade existing services or offer new serviceswithout the physical distribution required bytraditional applications and devices. Devicessuch as Web TVs last longer and gain useful-ness over time as they benefit automaticallyfrom every new Web-based service.

ComponentsFigure 1 shows the basic model for giant-scalesites. The model is based on several assumptions.First, I assume the service provider has limitedcontrol over the clients and the IP network.Greater control might be possible in some cases,however, such as with intranets. The model also

assumes that queries drive the service. This is truefor most common protocols including HTTP, FTP,and variations of RPC. For example, HTTP’s basicprimitive, the “get” command, is by definition aquery. My third assumption is that read-onlyqueries greatly outnumber updates (queries thataffect the persistent data store). Even sites that wetend to think of as highly transactional, such as e-commerce or financial sites, actually have thistype of “read-mostly” traffic1: Product evaluations(reads) greatly outnumber purchases (updates), forexample, and stock quotes (reads) greatly out-number stock trades (updates). Finally, as the side-bar, “Clusters in Giant-Scale Services” (next page)explains, all giant-scale sites use clusters.

The basic model includes six components:

! Clients, such as Web browsers, standalone e-mail readers, or even programs that use XMLand SOAP initiate the queries to the services.

! The best-effort IP network, whether the publicInternet or a private network such as anintranet, provides access to the service.

! The load manager provides a level of indirectionbetween the service’s external name and theservers’ physical names (IP addresses) to preservethe external name’s availability in the presenceof server faults. The load manager balances loadamong active servers. Traffic might flow throughproxies or firewalls before the load manager.

! Servers are the system’s workers, combiningCPU, memory, and disks into an easy-to-repli-cate unit.

IEEE INTERNET COMPUTING http://computer.org/internet/ JULY • AUGUST 2001 47

Giant-Scale Services

Client

Client

Client

Loadmanager

Persistent data store

Client

IP network

Single-site server

Optionalbackplane

Figure 1.The basic model for giant-scale services. Clients connect viathe Internet and then go through a load manager that hides downnodes and balances traffic.Load manager is a special-purpose computer that assigns

incoming HTTP connections to a particular machine.Image from Eric Brewer’s IEEE Internet Computing article.


Clusters also used for web servicesIn other applications, many machines work together on each transaction.

Example: Web searching. The search is partitioned over many machines, each of which holds a part of the database.

Altavista web search engine did not use clusters. Instead, Altavista used shared-memory multiprocessors. This approach could not scale with the web.

above 20 Gbits per second. They detect downnodes automatically, usually by monitoring openTCP connections, and thus dynamically isolatedown nodes from clients quite well.

Two other load-management approaches aretypically employed in combination with layer-4switches. The first uses custom “front-end” nodesthat act as service-specific layer-7 routers (in soft-ware).2 Wal-Mart’s site uses this approach, forexample, because it helps with session manage-ment: Unlike switches, the nodes track sessioninformation for each user.

The final approach includes clients in the load-management process when possible. This general“smart client” end-to-end approach goes beyondthe scope of a layer-4 switch.3 It greatly simplifiesswitching among different physical sites, which inturn simplifies disaster tolerance and overloadrecovery. Although there is no generic way to dothis for the Web, it is common with other systems.In DNS, for instance, clients know about an alter-native server and can switch to it if the primarydisappears; with cell phones this approach isimplemented as part of roaming; and applicationservers in the middle tier of three-tier databasesystems understand database failover.

Figures 2 and 3 illustrate systems at oppositeends of the complexity spectrum: a simple Web farmand a server similar to the Inktomi search enginecluster. These systems differ in load management,use of a backplane, and persistent data store.

The Web farm in Figure 2 uses round-robinDNS for load management. The persistent datastore is implemented by simply replicating all con-tent to all nodes, which works well with a smallamount of content. Finally, because all servers canhandle all queries, there is no coherence trafficand no need for a backplane. In practice, evensimple Web farms often have a second LAN (back-plane) to simplify manual updates of the replicas.In this version, node failures reduce system capac-ity, but not data availability.

In Figure 3, a pair of layer-4 switches managesthe load within the site. The “clients” are actuallyother programs (typically Web servers) that use thesmart-client approach to failover among differentphysical clusters, primarily based on load.

Because the persistent store is partitionedacross servers, possibly without replication, nodefailures could reduce the store’s effective size andoverall capacity. Furthermore, the nodes are nolonger identical, and some queries might need tobe directed to specific nodes. This is typicallyaccomplished using a layer-7 switch to parse

URLs, but some systems, such as clustered Webcaches, might also use the backplane to routerequests to the correct node.4

High AvailabilityHigh availability is a major driving requirementbehind giant-scale system design. Other infra-

IEEE INTERNET COMPUTING http://computer.org/internet/ JULY • AUGUST 2001 49

Giant-Scale Services

Client

Client

Client

Round-robin DNS

Simple replicated store

Client

IP network

Single-site server

Figure 2. A simple Web farm. Round-robin DNS assigns differentservers to different clients to achieve simple load balancing. Persis-tent data is fully replicated and thus all nodes are identical and canhandle all queries.

Program

Program

Program

Loadmanager

Partitioned data store

Program

IP network

Single-site server

Myrinet backplane

Figure 3. Search engine cluster. The service provides support to otherprograms (Web servers) rather than directly to end users.These pro-grams connect via layer-4 switches that balance load and hide faults.Persistent data is partitioned across the servers, which increasesaggregate capacity but implies there is some data loss when a serveris down. A backplane allows all nodes to access all data.


CS 152: What’s left ...

Monday 5/2: Final report due, 11:59 PM

Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.

Tuesday 5/10: Final presentations.

Watch email for final project peer review request.

No class on Thursday. Review session in Tuesday 5/2, + HKN (???).

Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs


Tomorrow: Final Project Checkoff

UC Regents Spring 2005 © UCBCS 152 L8: Pipelining I

Instruction Cache

Data Cache

DRAM

D

R

A

M

C

o

n

t

r

o

l

l

e

r

P

i

p

e

l

i

n

e

d

C

P

U

IC Bus IM Bus

DC Bus DM Bus

TAs will provide “secret” MIPS machine code tests.

Bonus points ifthese tests run by2 PM. If not, TAs give you test code to use over weekend

CS 152 Computer Architecture and Engineering Lecture 27 ...cs152/sp05/lecnotes/lec15-2.pdfComputer Architecture and Engineering Lecture 27 – Multiprocessors cs152/ TAs: Ted Hong

Documents