CS 152 Computer Architecture and Engineering Lecture 26 ...cs152/sp05/lecnotes/lec15-1.pdfMb yte L3 caches, wher eas P o w er5 systems hav e a 36-Mb yte L3 cache. The L3 cache operates

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-26John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 26 – Synchronization

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt


Last Time: How Routers Work

238 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 3, JUNE 1998

Fig. 1. MGR outline.

A. Design Summary

A simplified outline of the MGR design is shown in Fig. 1,

which illustrates the data processing path for a stream of

packets entering from the line card on the left and exiting

from the line card on the right.

The MGR consists of multiple line cards (each supporting

one or more network interfaces) and forwarding engine cards,

all plugged into a high-speed switch. When a packet arrives

at a line card, its header is removed and passed through the

switch to a forwarding engine. (The remainder of the packet

remains on the inbound line card). The forwarding engine

reads the header to determine how to forward the packet and

then updates the header and sends the updated header and

its forwarding instructions back to the inbound line card. The

inbound line card integrates the new header with the rest of

the packet and sends the entire packet to the outbound line

card for transmission.

Not shown in Fig. 1 but an important piece of the MGR

is a control processor, called the network processor, that

provides basic management functions such as link up/down

management and generation of forwarding engine routing

tables for the router.

B. Major Innovations

There are five novel elements of this design. This section

briefly presents the innovations. More detailed discussions,

when needed, can be found in the sections following.

First, each forwarding engine has a complete set of the

routing tables. Historically, routers have kept a central master

routing table and the satellite processors each keep only a

modest cache of recently used routes. If a route was not in a

satellite processor’s cache, it would request the relevant route

from the central table. At high speeds, the central table can

easily become a bottleneck because the cost of retrieving a

route from the central table is many times (as much as 1000

times) more expensive than actually processing the packet

header. So the solution is to push the routing tables down

into each forwarding engine. Since the forwarding engines

only require a summary of the data in the route (in particular,

next hop information), their copies of the routing table, called

forwarding tables, can be very small (as little as 100 kB for

about 50k routes [6]).

Second, the design uses a switched backplane. Until very

recently, the standard router used a shared bus rather than

a switched backplane. However, to go fast, one really needs

the parallelism of a switch. Our particular switch was custom

designed to meet the needs of an Internet protocol (IP) router.

Third, the design places forwarding engines on boards

distinct from line cards. Historically, forwarding processors

have been placed on the line cards. We chose to separate them

for several reasons. One reason was expediency; we were not

sure if we had enough board real estate to fit both forwarding

engine functionality and line card functions on the target

card size. Another set of reasons involves flexibility. There

are well-known industry cases of router designers crippling

their routers by putting too weak a processor on the line

card, and effectively throttling the line card’s interfaces to

the processor’s speed. Rather than risk this mistake, we built

the fastest forwarding engine we could and allowed as many

(or few) interfaces as is appropriate to share the use of the

forwarding engine. This decision had the additional benefit of

making support for virtual private networks very simple—we

can dedicate a forwarding engine to each virtual network and

ensure that packets never cross (and risk confusion) in the

forwarding path.

Placing forwarding engines on separate cards led to a fourth

innovation. Because the forwarding engines are separate from

the line cards, they may receive packets from line cards that

2. Forwarding engine determines the next hop for the packet, and returns next-hop data to the line card, together with an updated header.

2.

2.


Recall: Two CPUs sharing memory

supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for

Pow

er4-

base

d sy

stem

s pe

rfor

m

equa

lly

wel

l on

Pow

er5-

base

d sy

stem

s. F

igur

e 4

show

s th

ePo

wer

5’s i

nstr

uctio

n flo

w d

iagr

am.

In S

MT

mod

e, th

e Po

wer

5 us

es tw

o se

pa-

rate

inst

ruct

ion

fetc

h ad

dres

s reg

ister

s to

stor

eth

e pr

ogra

m c

ount

ers

for

the

two

thre

ads.

Inst

ruct

ion

fetc

hes

(IF

stag

e)

alte

rnat

ebe

twee

n th

e tw

o th

read

s. I

n ST

mod

e, t

hePo

wer

5 us

es o

nly

one

prog

ram

cou

nter

and

can

fetc

h in

stru

ctio

ns fo

r th

at t

hrea

d ev

ery

cycl

e. I

t ca

n fe

tch

up t

o ei

ght

inst

ruct

ions

from

the

inst

ruct

ion

cach

e (I

C s

tage

) ev

ery

cycl

e. T

he tw

o th

read

s sh

are

the

inst

ruct

ion

cach

e an

d th

e in

stru

ctio

n tr

ansla

tion

faci

lity.

In a

give

n cy

cle,

all f

etch

ed in

stru

ctio

ns co

me

from

the

sam

e th

read

.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).

In fact, it is an architectural challenge. Even letting several threads on one machine share memory is tricky.

In earlier lectures, we pretended it was easy to let several CPUs share a memory system.


Today: Hardware Thread Support

Producer/Consumer: One thread writes A, one thread reads A.

Locks: Two threads share write access to A.

On Thursday: Multiprocessor memory system design and synchronization issues.

Thursday is a simplified overview -- graduate-level architecture courses spend weeks on this topic ...


How 2 threads share a queue ...

Words in

Memory

Higher Address Numbers

Tail Head

We begin with an empty queue ...

Thread 1 (T1) adds data to the tail of the queue.“Producer” thread

Thread 2 (T2) takes data from the head of the queue.“Consumer” thread


Producer adding x to the queue ...

xWords

in Memory


Tail Head

Words in

MemoryHigher Address Numbers

Tail Head

T1 code(producer)

Before:

After:

ORi R1, R0, xval ; Load x value into R1LW R2, tail(R0) ; Load tail pointer into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr


Producer adding y to the queue ...

y xWords

in Memory


Tail Head

ORi R1, R0, yval ; Load y value into R1LW R2, tail(R0) ; Load tail pointer into R2 SW R1, 0(R2) ; Store y into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

xWords

in Memory


Tail Head

T1 code(producer)

Before:

After:


Consumer reading the queue ...

yWords

in Memory


Tail Head

LW R3, head(R0) ; Load head pointer into R3spin: LW R4, tail(R0) ; Load tail pointer into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head pointer

T2 code(consumer)

Before:

After:

y xWords

in Memory

Tail Head


What can go wrong?

Higher Addresses

LW R3, head(R0) ; Load head pointer into R3spin: LW R4, tail(R0) ; Load tail pointer into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head pointer

T2 code(consumer)

y x

Tail Head

y

Tail Head

After:Before:Higher Addresses

T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load tail pointer into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail pointer

1

2

3

4

What if order is 2, 3, 4, 1? Then, x is read before it is written!The CPU running T1 has no way to know its bad to delay 1 !


Leslie Lamport: Sequential ConsistencySequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order.

Sequential Consistent architectures get the right answer, but give up many optimizations.

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 code(consumer)

T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

1

2

3

4

Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2 ... but not 2, 3, 1, 4!


Efficient alternative: Memory barriersIn the general case, machine is not

sequentially consistent.When needed, a memory barrier may be added to the program (a fence).

All memory operations before fence complete, then memory operations after the fence begin.

ORi R1, R0, x ;LW R2, tail(R0) ;SW R1, 0(R2) ;MEMBARADDi R2, R2, 4 ;SW R2 0(tail) ;

1

2

Ensures 1 completes before 2 takes effect.

MEMBAR is expensive, but you only pay for it when you use it.

Many MEMBAR variations for efficiency (versions that only effect loads or stores, certain memory regions, etc).


Producer/consumer memory fences

Higher Addresses

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait MEMBAR ; LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 code(consumer)

y x

Tail Head

y

Tail Head


T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueMEMBAR ;ADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

1

2

3

4

Ensures 1 happens before 2, and 3 happens before 4.


Reminder: Final Project Checkoff

UC Regents Spring 2005 © UCBCS 152 L8: Pipelining I

Instruction Cache

Data Cache

DRAM

D

R

A

M

C

o

n

t

r

o

l

l

e

r

P

i

p

e

l

i

n

e

d

C

P

U

IC Bus IM Bus

DC Bus DM Bus

TAs will provide “secret” MIPS machine code tests.

Bonus points ifthese tests run by2 PM. If not, TAs give you test code to use over weekend


CS 152: What’s left ...

Monday 5/2: Final report due, 11:59 PM

Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.

Tuesday 5/10: Final presentations.

Watch email for final project peer review request.

No class on Thursday. Review session in Tuesday 5/2, + HKN (???).

Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs


Sharing Write Access


One producer, two consumers ...

Higher Addresses

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 & T3 (2 copes

of consumer thread)

y x

Tail Head

y

Tail Head


T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

Critical section: T2 and T3 must take turns running red code.


Abstraction: Semaphores (Dijkstra, 1965)Semaphore: unsigned int s s is initialized to the number of threads permitted in the critical section at once (in our example, 1).

P(s): If s > 0, s-- and return. Otherwise, sleep. When

woken do s-- and return. V(s): Do s++, awaken one

sleeping process,return.

P(s);

V(s);critical section (s=0)

Example use (initial s = 1):

When awake, V(s) and P(s) are atomic: no interruptions, with exclusive access to s.


LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

Critical section

Assuming sequential consistency: 3 MEMBARs not shown ...

Spin-Lock Semaphores: Test and Set

Test&Set(m, R)R = M[m];if (R == 0) then M[m]=1;

An example atomic read-modify-write ISA instruction:

What if the OS swaps a process out while in the critical section? “High-latency locks”, a source of Linux audio problems (and others)

P: Test&Set R6, mutex(R0); Mutex check BNE R6, R0, P ; If not 0, spin

V: SW R0 mutex(R0) ; Give up mutex

Note: With Test&Set(), the M[m]=1 state corresponds to last slide’s s=0 state!


Non-blocking synchronization ...

Compare&Swap(Rt,Rs, m)if (Rt == M[m])then M[m] = Rs; Rs = Rt; status = success;else status = fail;

Another atomic read-modify-write instruction:

If thread swaps out before Compare&Swap, no latency problem;this code only “holds” the lock for one instruction!

try: LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R6, R3, 4 ; Shift head by one word

Compare&Swap R3, R6, head(R0); Try to update head BNE R3, R6, try ; If not success, try again

If R3 != R6, another thread got here first, so we must try again.

Assuming sequential consistency: MEMBARs not shown ...


Semaphores with just LW & SW?Can we implement semaphores with just normal load and stores? Yes! Assuming sequential consistency ...

In practice, we create sequential consistency by using memory fenceinstructions ... so, not really “normal”.

Since load and store semaphore algorithms are quite tricky to get right, it is more convenient to use a Test&set or Compare&swap instead.


Conclusions: Synchronization

Memset: Memory fences, in lieu of full sequential consistency.

Test&Set: A spin-lock instruction for sharing write access.

Compare&Swap: A non-blocking alternative to share write access.

CS 152 Computer Architecture and Engineering Lecture 26 ...cs152/sp05/lecnotes/lec15-1.pdfMb yte L3 caches, wher eas P o w er5 systems hav e a 36-Mb yte L3 cache. The L3 cache operates

Documents