Single Cycle datapath. How to Design a Processor: step-by-step 1. Analyze instruction set => datapath requirements –the meaning of each instruction is.

Single Cycle datapath

How to Design a Processor: step-by-step

• 1. Analyze instruction set => datapath requirements– the meaning of each instruction is given by the register transfers

–datapath must include storage element for ISA registers

• possibly more

–datapath must support each register transfer

• 2. Select set of datapath components and establish clocking methodology

• 3. Assemble datapath meeting the requirements• 4. Analyze implementation of each instruction to determine

setting of control points that effects the register transfer.• 5. Assemble the control logic

The MIPS Instruction Formats

• All MIPS instructions are 32 bits long. The three instruction formats:

– R-type

– I-type

– J-type

• The different fields are:– op: operation of the instruction– rs, rt, rd: the source and destination register specifiers– shamt: shift amount– funct: selects the variant of the operation in the “op” field– address / immediate: address offset or immediate value– target address: target address of the jump instruction

op target address

02631

6 bits 26 bits

op rs rt rd shamt funct

061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

op rs rt immediate

016212631

6 bits 16 bits5 bits5 bits

Step 1a: The MIPS-lite Subset for today

• ADD and SUB–addU rd, rs, rt

– subU rd, rs, rt

• OR Immediate:–ori rt, rs, imm16

• LOAD and STORE Word– lw rt, rs, imm16

–sw rt, rs, imm16

• BRANCH:–beq rs, rt, imm16


061116212631


op rs rt immediate

016212631


op rs rt immediate

016212631


op rs rt immediate

016212631


Logical Register Transfers

• RTL gives the meaning of the instructions• All start by fetching the instruction

op | rs | rt | rd | shamt | funct = MEM[ PC ]

op | rs | rt | Imm16 = MEM[ PC ]

inst Register Transfers

ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4

SUBU R[rd] <– R[rs] – R[rt]; PC <– PC + 4

ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4

LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4

STORE MEM[ R[rs] + sign_ext(Imm16) ] <– R[rt]; PC <– PC + 4

BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || 00

else PC <– PC + 4

Step 1: Requirements of the Instruction Set

• Memory–instruction & data

• Registers (32 x 32)–read RS

–read RT

–Write RT or RD

• PC• Extender• Add and Sub register or extended immediate• Add 4 or extended immediate to PC

Step 2: Components of the Datapath

• Combinational Elements• Storage Elements

–Clocking methodology

Combinational Logic Elements (Basic Building Blocks)

• Adder

• MUX

• ALU

32

32

A

B32

Sum

Carry

32

32

A

B32

Result

OP

32A

B32

Y32

Select

Ad

der

MU

XA

LU

CarryIn

Storage Element: Register (Basic Building Block)

• Register–Similar to the D Flip Flop except

• N-bit input and output

• Write Enable input

–Write Enable:• negated (0): Data Out will not change

• asserted (1): Data Out will become Data In

Clk

Data In

Write Enable

N N

Data Out

Storage Element: Register File• Register File consists of 32 registers:

–Two 32-bit output busses: busA and busB–One 32-bit input bus: busW

• Register is selected by:–RA (number) selects the register to put on busA (data)–RB (number) selects the register to put on busB (data)–RW (number) selects the register to be written

via busW (data) when Write Enable is 1

• Clock input (CLK) –The CLK input is a factor ONLY during write operation–During read operation, behaves as a combinational logic

block:• RA or RB valid => busA or busB valid after “access

time.”

Clk

busW

Write Enable

3232

busA

32busB

5 5 5RWRARB

32 32-bitRegisters

Storage Element: Idealized Memory

• Memory (idealized)–One input bus: Data In

–One output bus: Data Out

• Memory word is selected by:–Address selects the word to put on Data Out

–Write Enable = 1: address selects the memoryword to be written via the Data In bus

• Clock input (CLK) –The CLK input is a factor ONLY during write operation

–During read operation, behaves as a combinational logic block:

• Address valid => Data Out valid after “access time.”

Clk

Data In

Write Enable

32 32DataOut

Address

Clocking Methodology

• All storage elements are clocked by the same clock edge

• Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock Skew

• (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time

Clk

Don’t Care

Setup Hold

.

.

.

.

.

.

.

.

.

.

.

.

Setup Hold

Step 3

• Register Transfer Requirements –> Datapath Assembly

• Instruction Fetch• Read Operands and Execute Operation

3a: Overview of the Instruction Fetch Unit

• The common RTL operations–Fetch the Instruction: mem[PC]

–Update the program counter:• Sequential Code: PC <- PC + 4

• Branch and Jump: PC <- “something else”

32

Instruction WordAddress

InstructionMemory

PCClk

Next AddressLogic

3b: Add & Subtract

• R[rd] <- R[rs] op R[rt] Example: addU rd, rs, rt–Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields

–ALUctr and RegWr: control logic after decoding the instruction

32

Result

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

5 5 5

Rw Ra Rb

32 32-bitRegisters

Rs RtRd

AL

Uop rs rt rd shamt funct

061116212631


Register-Register Timing

32Result

ALUctr

Clk

busW

RegWr

3232

busA

32busB

5 5 5

Rw Ra Rb32 32-bitRegisters

Rs RtRd

AL

U

Clk

PC

Rs, Rt, Rd,Op, Func

Clk-to-Q

ALUctr

Instruction Memory Access Time

Old Value New Value

RegWr Old Value New Value

Delay through Control Logic

busA, BRegister File Access Time

Old Value New Value

busW

ALU Delay

Old Value New Value

Old Value New Value

New ValueOld Value

Register WriteOccurs Here

3c: Logical Operations with Immediate• R[rt] <- R[rs] op ZeroExt[imm16] ]

32

Result

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

5 5 5

Rw Ra Rb

32 32-bitRegisters

Rs

RtRdRegDst

ZeroE

xt

Mu

x

Mux

3216imm16

ALUSrc

AL

U

11

op rs rt immediate

016212631

6 bits 16 bits5 bits5 bits rd?

immediate

016 1531

16 bits16 bits

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3d: Load Operations

• R[rt] <- Mem[R[rs] + SignExt[imm16]] Example: lw rt, rs, imm16

11

op rs rt immediate

016212631

6 bits 16 bits5 bits5 bits rd

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

5 5 5

Rw Ra Rb

32 32-bitRegisters

Rs

RtRd

RegDst

Exten

der

Mu

x

Mux

3216

imm16

ALUSrc

ExtOp

Clk

Data InWrEn

32

Adr

DataMemory

32

AL

U

MemWr Mu

x

W_Src

3e: Store Operations

• Mem[ R[rs] + SignExt[imm16] <- R[rt] ] Example: sw rt, rs, imm16

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

Rd

RegDst

Exten

der

Mu

x

Mux

3216imm16

ALUSrcExtOp

Clk

Data InWrEn

32

Adr

DataMemory

MemWr

AL

U

op rs rt immediate

016212631


32

Mu

x

W_Src

3f: The Branch Instruction

• beq rs, rt, imm16

–mem[PC] Fetch the instruction from memory

–Equal <- R[rs] == R[rt] Calculate the branch condition

– if (COND eq 0) Calculate the next instruction’s address

• PC <- PC + 4 + ( SignExt(imm16) x 4 )– else

• PC <- PC + 4

op rs rt immediate

016212631


Datapath for Branch Operations

• beq rs, rt, imm16Datapath generates condition (equal)

op rs rt immediate

016212631


32

imm16

PC

Clk

00

Ad

der

Mu

x

Ad

der

4nPC_sel

Clk

busW

RegWr

32

busA

32

busB

5 5 5

Rw Ra Rb

32 32-bitRegisters

Rs Rt

Eq

ual

?

Cond

PC

Ext

Inst Address

Putting it All Together: A Single Cycle Datapathim

m16

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

3216imm16

ALUSrcExtOp

Mu

x

MemtoReg

Clk

Data InWrEn32 Adr

DataMemory

MemWrA

LU

Equal

Instruction<31:0>

0

1

0

1

01

<21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRtRs

=

Ad

der

Ad

der

PC

Clk

00

Mu

x

4

nPC_sel

PC

Ext

Adr

InstMemory

Step 4: Given Datapath: RTL -> Control

ALUctrRegDst ALUSrcExtOp MemtoRegMemWr Equal

Instruction<31:0>

<21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel

Adr

InstMemory

DATA PATH

Control

Op

<21:25>

Fun

RegWr

Meaning of the Control Signals• ExtOp: “zero”, “sign”

• ALUsrc: 0 => regB; 1 => immed

• ALUctr: “add”, “sub”, “or”

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

3216imm16

ALUSrcExtOp

Mu

x

MemtoReg

Clk

Data InWrEn32 Adr

DataMemory

MemWr

AL

U

Equal

0

1

0

1

01

MemWr: write memory

MemtoReg: 1 => Mem

RegDst: 0 => “rt”; 1 => “rd”

RegWr: write dest register

=

Example: Load Instruction

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

3216imm16

ALUSrcExtOp

Mu

x

MemtoReg

Clk

Data InWrEn32 Adr

DataMemory

MemWrA

LU

Equal

Instruction<31:0>

0

1

0

1

01

<21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRtRs

=

imm

16

Ad

der

Ad

der

PC

Clk

00

Mu

x

4

nPC_sel

PC

Ext

Adr

InstMemory

sign ext

addrt+4

An Abstract View of the Implementation

• Logical vs. Physical Structure

DataOut

Clk

5

Rw Ra Rb

32 32-bitRegisters

Rd

AL

U

Clk

Data In

DataAddress

IdealData

Memory

Instruction

InstructionAddress

IdealInstruction

Memory

Clk

PC

5Rs

5Rt

32

323232

A

BNex

t A

dd

ress

Control

Datapath

Control Signals Conditions

Summary

• 5 steps to design a processor– 1. Analyze instruction set => datapath requirements

– 2. Select set of datapath components & establish clock methodology

– 3. Assemble datapath meeting the requirements

– 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.

– 5. Assemble the control logic

• MIPS makes it easier– Instructions same size

– Source registers always in same place

– Immediates same size, location

– Operations always on registers/immediates

• Single cycle datapath => CPI=1, CCT => long

• Next time: implementing control

Recap: A Single Cycle Datapath

• We have everything except control signals (underline)–Today’s lecture will show you how to generate the control

signals

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

Mux

3216imm16

ALUSrc

ExtOp

Mu

x

MemtoReg

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWrA

LU

InstructionFetch Unit

Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel

RTL: The Add Instruction

• add rd, rs, rt

–mem[PC] Fetch the instruction from memory

–R[rd] <- R[rs] + R[rt] The actual operation

–PC <- PC + 4 Calculate the next instruction’s

address


061116212631


Instruction Fetch Unit at the Beginning of Add

PC

Ext

• Fetch the instruction from Instruction memory: Instruction <- mem[PC]

– This is the same for all instructions

Adr

InstMemory

Ad

der

Ad

der

PC

Clk

00

Mu

x

4

nPC_sel

imm

16

Instruction<31:0>

The Single Cycle Datapath during Add

32

ALUctr = Add

Clk

busW

RegWr = 1

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst = 1

Exten

der

Mu

x

Mux

3216imm16

ALUSrc = 0

ExtOp = x

Mu

x

MemtoReg = 0

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWr = 0A

LU


Clk

Zero

Instruction<31:0>• R[rd] <- R[rs] + R[rt]

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt


061116212631

nPC_sel= +4

Instruction Fetch Unit at the End of Add• PC <- PC + 4

–This is the same for all instructions except: Branch and Jump

Adr

InstMemory

Ad

der

Ad

der

PC

Clk

00

Mu

x

4

nPC_sel

imm

16

Instruction<31:0>

The Single Cycle Datapath during Or Immediate

• R[rt] <- R[rs] or ZeroExt[Imm16]

op rs rt immediate

016212631

32

ALUctr =

Clk

busW

RegWr =

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst =

Exten

der

Mu

x

Mux

3216imm16

ALUSrc =

ExtOp =

Mu

x

MemtoReg =

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWr = A

LU


Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel =

The Single Cycle Datapath during Load

32

ALUctr = Add

Clk

busW

RegWr = 1

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst = 0

Exten

der

Mu

x

Mux

3216imm16

ALUSrc = 1

ExtOp = 1

Mu

x

MemtoReg = 1

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWr = 0A

LU


Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

• R[rt] <- Data Memory {R[rs] + SignExt[imm16]}

op rs rt immediate

016212631

nPC_sel= +4

The Single Cycle Datapath during Store

• Data Memory {R[rs] + SignExt[imm16]} <- R[rt]

op rs rt immediate

016212631

32

ALUctr =

Clk

busW

RegWr =

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst =

Exten

der

Mu

x

Mux

3216imm16

ALUSrc =

ExtOp =

Mu

x

MemtoReg =

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWr = A

LU


Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel =

The Single Cycle Datapath during Store

32

ALUctr = Add

Clk

busW

RegWr = 0

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst = x

Exten

der

Mu

x

Mux

3216imm16

ALUSrc = 1

ExtOp = 1

Mu

x

MemtoReg = x

Clk

Data InWrEn

32Adr

DataMemory

32

MemWr = 1A

LU


Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

• Data Memory {R[rs] + SignExt[imm16]} <- R[rt]

op rs rt immediate

016212631

nPC_sel= +4

The Single Cycle Datapath during Branch

32

ALUctr = Subtract

Clk

busW

RegWr = 0

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst = x

Exten

der

Mu

x

Mux

3216imm16

ALUSrc = 0

ExtOp = x

Mu

x

MemtoReg = x

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWr = 0A

LU


Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

• if (R[rs] - R[rt] == 0) then Zero <- 1 ; else Zero <- 0

op rs rt immediate

016212631

nPC_sel= “Br”

Instruction Fetch Unit at the End of Branch

• if (Zero == 1) then PC = PC + 4 + SignExt[imm16]*4 ; else PC = PC + 4

op rs rt immediate

016212631

Adr

InstMemory

Ad

der

Ad

der

PC

Clk

00

Mu

x

4

nPC_sel

imm

16

Instruction<31:0>

Step 4: Given Datapath: RTL -> Control

ALUctrRegDst ALUSrcExtOp MemtoRegMemWr Equal

Instruction<31:0>

<21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel

Adr

InstMemory

DATA PATH

Control

Op

<21:25>

Fun

RegWr

A Summary of Control Signals

inst Register Transfer

ADD R[rd] <– R[rs] + R[rt]; PC <– PC + 4

ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, nPC_sel = “+4”

SUB R[rd] <– R[rs] – R[rt]; PC <– PC + 4

ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, nPC_sel = “+4”

ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4

ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, nPC_sel = “+4”

LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4

ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemtoReg, RegDst = rt, RegWr, nPC_sel = “+4”

STORE MEM[ R[rs] + sign_ext(Imm16)] <– R[rs]; PC <– PC + 4

ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, nPC_sel = “+4”

BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || 00 else PC <– PC + 4

nPC_sel = “Br”, ALUctr = “sub”

A Summary of the Control Signals

add sub ori lw sw beq jump

RegDst

ALUSrc

MemtoReg

RegWrite

MemWrite

nPCsel

Jump

ExtOp

ALUctr<2:0>

1

0

0

1

0

0

0

x

Add

1

0

0

1

0

0

0

x

Subtract

0

1

0

1

0

0

0

0

Or

0

1

1

1

0

0

0

1

Add

x

1

x

0

1

0

0

1

Add

x

0

x

0

0

1

0

x

Subtract

x

x

x

0

0

0

1

x

xxx

op target address


061116212631

op rs rt immediate

R-type

I-type

J-type

add, sub

ori, lw, sw, beq

jump

func

op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010Appendix A10 0000See 10 0010 We Don’t Care :-)

The Concept of Local Decoding

R-type ori lw sw beq jump

RegDst

ALUSrc

MemtoReg

RegWrite

MemWrite

Branch

Jump

ExtOp

ALUop<N:0>

1

0

0

1

0

0

0

x

“R-type”

0

1

0

1

0

0

0

0

Or

0

1

1

1

0

0

0

1

Add

x

1

x

0

1

0

0

1

Add

x

0

x

0

0

1

0

x

Subtract

x

x

x

0

0

0

1

x

xxx

op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010

MainControl

op

6

ALUControl(Local)

func

N

6ALUop

ALUctr

3

AL

U

The Encoding of ALUop

• In this exercise, ALUop has to be 2 bits wide to represent:– (1) “R-type” instructions

– “I-type” instructions that require the ALU to perform:

• (2) Or, (3) Add, and (4) Subtract

• To implement the full MIPS ISA, ALUop has to be 3 bits to represent:

– (1) “R-type” instructions

– “I-type” instructions that require the ALU to perform:

• (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi)

MainControl

op

6

ALUControl(Local)

func

N

6ALUop

ALUctr

3


ALUop (Symbolic) “R-type” Or Add Add Subtract xxx

ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx

The Decoding of the “func” Field


ALUop (Symbolic) “R-type” Or Add Add Subtract xxx

ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx

MainControl

op

6

ALUControl(Local)

func

N

6ALUop

ALUctr

3


061116212631

R-type

funct<5:0> Instruction Operation

10 0000

10 0010

10 0100

10 0101

10 1010

add

subtract

and

or

set-on-less-than

ALUctr<2:0> ALU Operation

000

001

010

110

111

Add

Subtract

And

Or

Set-on-less-than

Recall ALU Homework (also P. 286 text):

ALUctr

AL

U

The Truth Table for ALUctr

R-type ori lw sw beqALUop(Symbolic) “R-type” Or Add Add Subtract

ALUop<2:0> 1 00 0 10 0 00 0 00 0 01

ALUop func

bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3>

0 0 0 x x x x

ALUctrALUOperation

Add 0 1 0

bit<2> bit<1> bit<0>

0 x 1 x x x x Subtract 1 1 0

0 1 x x x x x Or 0 0 1

1 x x 0 0 0 0 Add 0 1 0

1 x x 0 0 1 0 Subtract 1 1 0

1 x x 0 1 0 0 And 0 0 0

1 x x 0 1 0 1 Or 0 0 1

1 x x 1 0 1 0 Set on < 1 1 1

funct<3:0> Instruction Op.

0000

0010

0100

0101

1010

add

subtract

and

or

set-on-less-than

The Logic Equation for ALUctr<2>ALUop func

bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3> ALUctr<2>

0 x 1 x x x x 1

1 x x 0 0 1 0 1

1 x x 1 0 1 0 1

• ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !

func<0>

This makes func<3> a don’t care


bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3>

0 0 0 x x x x 1

ALUctr<1>

0 x 1 x x x x 1

1 x x 0 0 0 0 1

1 x x 0 0 1 0 1

1 x x 1 0 1 0 1

• ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0>


bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3> ALUctr<0>

0 1 x x x x x 1

1 x x 0 1 0 1 1

1 x x 1 0 1 0 1

• ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !

func<1> & func<0>

+ ALUop<2> & func<3> & !func<2> & func<1> & !func<0>

The ALU Control Block

ALUControl(Local)

func

3

6ALUop

ALUctr

3

• ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0>

• ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0>

• ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> &

func<0>

+ ALUop<2> & func<3> & !func<2> & func<1> & !func<0>

The “Truth Table” for the Main Control


RegDst

ALUSrc

MemtoReg

RegWrite

MemWrite

Branch

Jump

ExtOp

ALUop (Symbolic)

1

0

0

1

0

0

0

x

“R-type”

0

1

0

1

0

0

0

0

Or

0

1

1

1

0

0

0

1

Add

x

1

x

0

1

0

0

1

Add

x

0

x

0

0

1

0

x

Subtract

x

x

x

0

0

0

1

x

xxx

op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010

ALUop <2> 1 0 0 0 0 x

ALUop <1> 0 1 0 0 0 x

ALUop <0> 0 0 0 0 1 x

MainControl

op

6

ALUControl(Local)

func

3

6

ALUop

ALUctr

3

RegDst

ALUSrc

:

Putting it All Together: A Single Cycle Processor

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

Mux

3216imm16

ALUSrc

ExtOp

Mu

x

MemtoReg

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWrA

LU


Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

MainControl

op

6

ALUControlfunc

6

3

ALUopALUctr

3RegDst

ALUSrc

:Instr<5:0>

Instr<31:26>

Instr<15:0>

nPC_sel

Worst Case Timing (Load)Clk

PC

Rs, Rt, Rd,Op, Func

Clk-to-Q

ALUctr

Instruction Memoey Access Time

Old Value New Value

RegWr Old Value New Value

Delay through Control Logic

busA

Register File Access Time

Old Value New Value

busB

ALU Delay

Old Value New Value

Old Value New Value

New ValueOld Value

ExtOp Old Value New Value

ALUSrc Old Value New Value

MemtoReg Old Value New Value

Address Old Value New Value

busW Old Value New

Delay through Extender & Mux

RegisterWrite Occurs

Data Memory Access Time

Drawback of this Single Cycle Processor

• Long cycle time:–Cycle time must be long enough for the load instruction:

PC’s Clock -to-Q +

Instruction Memory Access Time +

Register File Access Time +

ALU Delay (address calculation) +

Data Memory Access Time +

Register File Setup Time +

Clock Skew

• Cycle time for load is much longer than needed for all other instructions

ฐSingle cycle datapath => CPI=1, CCT => long

ฐ5 steps to design a processor• 1. Analyze instruction set => datapath requirements

• 2. Select set of datapath components & establish clock methodology

• 3. Assemble datapath meeting the requirements

• 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.

• 5. Assemble the control logic

ฐControl is the hard part

ฐMIPS makes control easier• Instructions same size

• Source registers always in same place

• Immediates same size, location

• Operations always on registers/immediates

Summary

Control

Datapath

Memory

ProcessorInput

Output

Multicycle Datapath

Partitioning the CPI=1 Datapath

• Add registers between smallest steps

PC

Nex

t P

C

Ope

rand

Fet

ch Exec Reg

. F

ile

Mem

Acc

ess

Dat

aM

emInst

ruct

ion

Fet

ch

Res

ult

Sto

reAL

Uct

r

Reg

Dst

AL

US

rc

Ext

Op

Mem

Wr

nPC

_sel

Reg

Wr

Mem

Wr

Mem

Rd

Example Multicycle Datapath

• Critical Path ?

PC

Nex

t P

C

Ope

rand

Fet

ch

Ext

ALU Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

Inst

ruct

ion

Fet

ch

Res

ult

Sto

re

AL

Uct

r

Reg

Dst

AL

US

rc

Ext

Op

nPC

_sel

Reg

Wr

Mem

Wr

Mem

Rd

IRA

B

R

M

RegFile

Mem

ToR

eg

Equ

al

Recall: Step-by-step Processor Design

Step 1: ISA => Logical Register Transfers

Step 2: Components of the Datapath

Step 3: RTL + Components => Datapath

Step 4: Datapath + Logical RTs => Physical RTs

Step 5: Physical RTs => Control

Step 4: R-rtype (add, sub, . . .)

• Logical Register Transfer

• Physical Register Transfers

inst Logical Register Transfers

ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4

inst Physical Register Transfers

IR <– MEM[pc]

ADDU A<– R[rs]; B <– R[rt]

S <– A + B

R[rd] <– S; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

Step 4:Logical immed




ADDU R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4


IR <– MEM[pc]

ADDU A<– R[rs]; B <– R[rt]

S <– A or ZeroExt(Im16)

R[rt] <– S; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

Step 4 : Load




LW R[rt] <– MEM(R[rs] + sx(Im16);

PC <– PC + 4


IR <– MEM[pc]

LW A<– R[rs]; B <– R[rt]

S <– A + SignEx(Im16)

M <– MEM[S]

R[rd] <– M; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

Step 4 : Store




SW MEM(R[rs] + sx(Im16) <– R[rt];

PC <– PC + 4


IR <– MEM[pc]

SW A<– R[rs]; B <– R[rt]

S <– A + SignEx(Im16);

MEM[S] <– B PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

Step 4 : Branch




BEQ if R[rs] == R[rt]

then PC <= PC + sx(Im16) || 00

else PC <= PC + 4


IR <– MEM[pc]

BEQ|Eq PC <– PC + 4


IR <– MEM[pc]

BEQ|Eq PC <– PC + sx(Im16) || 00

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

Alternative datapath (book): Multiple Cycle Datapath• Miminizes Hardware: 1 memory, 1 adder

IdealMemoryWrAdrDin

RAdr

32

32

32Dout

MemWr

32

AL

U

3232

ALUOp

ALUControl

Instru

ction R

eg

32

IRWr

32

Reg File

Ra

Rw

busW

Rb5

5

32busA

32busB

RegWr

Rs

Rt

Mu

x

0

1

Rt

Rd

PCWr

ALUSelA

Mux 01

RegDst

Mu

x

0

1

32

PC

MemtoReg

Extend

ExtOp

Mu

x0

132

0

1

23

4

16Imm 32

<< 2

ALUSelB

Mu

x1

0

Target32

Zero

ZeroPCWrCond PCSrc BrWr

32

IorD

AL

U O

ut

Our Control Model

• State specifies control points for Register Transfer

• Transfer occurs upon exiting state (same falling edge)

Control State

Next StateLogic

Output Logic

inputs (conditions)

outputs (control points)

State X

Register TransferControl Points

Depends on Input

Step 4 => Control Specification for multicycle proc

IR <= MEM[PC]

R-type

A <= R[rs]B <= R[rt]

S <= A fun B

R[rd] <= SPC <= PC + 4

S <= A or ZX

R[rt] <= SPC <= PC + 4

ORi

S <= A + SX

R[rt] <= MPC <= PC + 4

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= BPC <= PC + 4

BEQ & EqualBEQ & ~Equal

PC <= PC + 4 PC <= PC + SX || 00

SW

“instruction fetch”

“decode / operand fetch”

Exe

cute

Mem

ory

Writ

e-ba

ck

Step 5: datapath + state diagram => control

• Translate RTs into control points• Assign states

• Then go build the controller

Mapping RTs to Control PointsIR <= MEM[PC]

R-type


S <= A fun B

R[rd] <= SPC <= PC + 4

S <= A or ZX

R[rt] <= SPC <= PC + 4

ORi

S <= A + SX

R[rt] <= MPC <= PC + 4

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= BPC <= PC + 4


PC <= PC + 4 PC <= PC + SX || 00

SW


“decode”

Exe

cute

Mem

ory

Writ

e-ba

ck

imem_rd, IRen

ALUfun, Sen

RegDst, RegWr,PCen

Aen, Ben

Assigning States

IR <= MEM[PC]

R-type


S <= A fun B

R[rd] <= SPC <= PC + 4

S <= A or ZX

R[rt] <= SPC <= PC + 4

ORi

S <= A + SX

R[rt] <= MPC <= PC + 4

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= BPC <= PC + 4


PC <= PC + 4 PC <= PC + SX || 00

SW


“decode”

Exe

cute

Mem

ory

Writ

e-ba

ck

0000

0001

0100

0101

0110

0111

1000

1001

1010

0011 00101011

1100

Detailed Control Specification

0000 ?????? ? 0001 10001 BEQ 0 0011 1 10001 BEQ 1 0010 1 10001 R-type x 0100 1 10001 orI x 0110 1 10001 LW x 1000 1 10001 SW x 1011 1 10010 xxxxxx x 0000 1 10011 xxxxxx x 0000 1 00100 xxxxxx x 0101 0 1 fun 10101 xxxxxx x 0000 1 0 0 1 10110 xxxxxx x 0111 0 0 or 10111 xxxxxx x 0000 1 0 0 1 01000 xxxxxx x 1001 1 0 add 11001 xxxxxx x 1010 1 0 01010 xxxxxx x 0000 1 0 1 1 01011 xxxxxx x 1100 1 0 add 11100 xxxxxx x 0000 1 0 0 1

State Op field Eq Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst

R:

ORi:

LW:

SW:

-all same in Moore machine

Controller Design

• The state digrams that arise define the controller for an instruction set processor are highly structured

• Use this structure to construct a simple “microsequencer”

• Control reduces to programming this very simple device

– microprogramming

sequencercontrol

datapath control

micro-PCsequencer

microinstruction

Example: Jump-Counter

op-codeMap ROM

Counterzeroincload

0000i

i+1

i

Using a Jump Counter

IR <= MEM[PC]

R-type


S <= A fun B

R[rd] <= SPC <= PC + 4

S <= A or ZX

R[rt] <= SPC <= PC + 4

ORi

S <= A + SX

R[rt] <= MPC <= PC + 4

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= BPC <= PC + 4


PC <= PC + 4 PC <= PC + SX || 00

SW


“decode”

Exe

cute

Mem

ory

Writ

e-ba

ck

0000

0001

0100

0101

0110

0111

1000

1001

1010

0011 00101011

1100

inc

load inc

zero zero

zero zero

zero zeroinc inc inc inc

inc

Our Microsequencer

op-code

Map ROM

Micro-PC

Z I Ldatapath control

taken

Microprogram Control Specification

0000 ? inc 10001 0 load0001 1 inc0010 x zero 1 10011 x zero 1 00100 x inc 0 1 fun 10101 x zero 1 0 0 1 10110 x inc 0 0 or 10111 x zero 1 0 0 1 01000 x inc 1 0 add 11001 x inc 1 0 01010 x zero 1 0 1 1 01011 x inc 1 0 add 11100 x zero 1 0 0 1

ตPC Taken Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst

R:

ORi:

LW:

SW:

BEQ

Mapping ROM

R-type 000000 0100

BEQ 000100 0011

ori 001101 0110

LW 100011 1000

SW 101011 1011

Overview of Control• Control may be designed using one of several initial representations. The

choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique.

Initial Representation Finite State Diagram Microprogram

Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs

Logic Representation Logic Equations Truth Tables

Implementation PLA ROM Technique “hardwired control” “microprogrammed control”

Summary

• Disadvantages of the Single Cycle Processor

–Long cycle time

–Cycle time is too long for all instructions except the Load

• Multiple Cycle Processor:

–Divide the instructions into smaller steps

–Execute each step (instead of the entire instruction) in one cycle

• Partition datapath into equal size chunks to minimize cycle time

– ~10 levels of logic between latches

• Follow same 5-step method for designing “real” processor

Summary (cont’d)

• Control is specified by finite state digram• Specialize state-diagrams easily captured by

microsequencer–simple increment & “branch” fields–datapath control fields

• Control design reduces to Microprogramming • Control is more complicated with:

–complex instruction sets– restricted datapaths (see the book)

• Simple Instruction set and powerful datapath => simple control

–could try to reduce hardware (see the book)– rather go for speed => many instructions at once!

Our Controller FSM Spec IR <= MEM[PC]

PC <= PC + 4

R-type


S <= A fun B

R[rd] <= S

S <= A op ZX

R[rt] <= S

ORi

S <= A + SX

R[rt] <= M

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= B

SW


“decode”

Exe

cute

Mem

ory

Writ

e-ba

ck

0000

0001

0100

0101

0110

0111

1000

1001

1010

1011

1100

~EqualEqual

BEQ

PC <= PC + SX || 00

0010

0011

S <= A - B

Microprogramming

• Control is the hard part of processor designฐ Datapath is fairly regular and well-organized

ฐ Memory is highly regular

ฐ Control is irregular and globalMicroprogramming:

-- A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations

Microarchitecture:

-- Logical structure and functional capabilities of the hardware as seen by the microprogrammer

Historical Note:

IBM 360 Series first to distinguish between architecture & organizationSame instruction set across wide range of implementations, each with different cost/performance

Sequencer-based control unit

Opcode

State Reg

Inputs

Outputs

Control Logic MulticycleDatapath

1

Address Select Logic

Adder

Types of “branching”• Set state to 0• Dispatch (state 1)• Use incremented state number

Designing a Microinstruction Set

1) Start with list of control signals

2) Group signals together that make sense (vs. random): called “fields”

3) Places fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last)

4) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals

–Use computers to design computers

5) To minimize the width, encode operations that will never be used at the same time

1&2) Start with list of control signals, grouped into fieldsSignal name Effect when deasserted Effect when asserted

ALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs]RegWrite None Reg. is written MemtoReg Reg. write data input = ALU Reg. write data input = memory RegDst Reg. dest. no. = rt Reg. dest. no. = rdTargetWrite None Target reg. = ALU MemRead None Memory at address is readMemWrite None Memory at address is written IorD Memory address = PC Memory address = ALUIRWrite None IR = MemoryPCWrite None PC = PCSourcePCWriteCond None IF ALUzero then PC = PCSource

Sin

gle

Bit

Con

trol

Signal name Value Effect ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code

11 ALU does logical OR ALUSelB 000 2nd ALU input = Reg[rt] 001 2nd ALU input = 4 010 2nd ALU input = sign extended IR[15-0] 011 2nd ALU input = sign extended, shift left 2 IR[15-0]

100 2nd ALU input = zero extended IR[15-0] PCSource 00 PC = ALU 01 PC = Target 10 PC = PC+4[29-26] : IR[25–0] << 2

Mu

ltip

le B

it C

ontr

ol

Start with list of control signals, cont’d

• For next state function (next microinstruction address), use Sequencer-based control unit from last lecture

– Called “microPC” or “PC” vs. state register

Signal Value Effect Sequen 00 Next address = 0 -cing 01 Next address = dispatch ROM

10 Next address = address + 1

Opcode

microPC

1

ตAddressSelectLogic

Adder

ROM

Mux

0012

3) Microinstruction Format: unencoded vs. encoded fields

Field Name Width Control Signals Set

wide narrowALU Control 4 2 ALUOp

SRC1 2 1 ALUSelA

SRC2 5 3 ALUSelB

ALU Destination 6 4 RegWrite, MemtoReg, RegDst, TargetWr.

Memory 4 3 MemRead, MemWrite, IorD

Memory Register 1 1 IRWrite

PCWrite Control 5 4 PCWrite, PCWriteCond, PCSource

Sequencing 3 2 AddrCtl

Total width 30 20 bits

4) Legend of Fields and Symbolic NamesField Name Values for Field Function of Field with Specific Value

ALU Add ALU addsSubt. ALU subtractsFunc code ALU does function codeOr ALU does logical OR

SRC1 PC 1st ALU input = PCrs 1st ALU input = Reg[rs]

SRC2 4 2nd ALU input = 4Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0]rt 2nd ALU input = Reg[rt]

ALU destination Target Target = ALUoutrd Reg[rd] = ALUout

Memory Read PC Read memory using PCRead ALU Read memory using ALU outputWrite ALU Write memory using ALU output

Memory register IR IR = MemWrite rt Reg[rt] = MemRead rt Mem = Reg[rt]

PC write ALU PC = ALU outputTarget-cond. IF ALU Zero then PC = Targetjump addr. PC = PCSource

Sequencing Seq Go to sequential ตinstructionFetch Go to the first microinstructionDispatch Dispatch using ROM.

Microprogramming Pros and Cons

• Ease of design

• Flexibility– Easy to adapt to changes in organization, timing, technology

– Can make changes late in design cycle, or even in the field

• Can implement very powerful instruction sets (just more control memory)

• Generality– Can implement multiple instruction sets on same machine.

– Can tailor instruction set to application.

• Compatibility– Many organizations, same instruction set

• Costly to implement

• Slow

Exceptions

• Exception = unprogrammed control transfer–system takes action to handle the exception

• must record the address of the offending instruction–returns control to user–must save & restore user state

• Allows constuction of a “user virtual machine”

user program

normal control flow: sequential, jumps, branches, calls, returns

SystemExceptionHandlerException:

return fromexception

What happens to Instruction with Exception?

• MIPS architecture defines the instruction as having no effect if the instruction causes an exception.

• When get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state.

• This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard

Two Types of Exceptions

• Interrupts–caused by external events–asynchronous to program execution–may be handled between instructions–simply suspend and resume user program

• Traps–caused by internal events

• exceptional conditions (overflow)• errors (parity)• faults (non-resident page)

–synchronous to program execution–condition must be remedied by the handler– instruction may be retried or simulated and program

continued or program may be aborted

MIPS convention:• exception means any unexpected change in control flow,

without distinguishing internal or external; use the term interrupt only when the event is externally caused.

Type of event From where? MIPS terminologyI/O device request External InterruptInvoke OS from user program InternalExceptionArithmetic overflow Internal ExceptionUsing an undefined instruction InternalExceptionHardware malfunctions Either Exception or

Interrupt

Additions to MIPS ISA to support Exceptions?

• EPC–a 32-bit register used to hold the address of the affected instruction (register 14 of coprocessor 0).

• Cause–a register used to record the cause of the exception. In the MIPS architecture this register is 32 bits, though some bits are currently unused. Assume that bits 5 to 2 of this register encodes the two possible exception sources mentioned above: undefined instruction=0 and arithmetic overflow=1 (register 13 of coprocessor 0).

• BadVAddr - register contained memory address at which memory reference occurred (register 8 of coprocessor 0)

• Status - interrupt mask and enable bits (register 12 of coprocessor 0)

• Control signals to write EPC , Cause, BadVAddr, and Status

• Be able to write exception address into PC, increase mux to add as input 01000000 00000000 00000000 01000000two (8000 0080hex)

• May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor); PC = PC - 4

How Control Detects Exceptions in our FSD

• Undefined Instruction–detected when no next state is defined from state 1 for the op value.

– We handle this exception by defining the next state value for all op values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12.

– Shown symbolically using “other” to indicate that the op field does not match any of the opcodes that label arcs out of state 1.

• Arithmetic overflow–Chapter 4 included logic in the ALU to detect overflow, and a signal called Overflow is provided as an output from the ALU. This signal is used in the modified finite state machine to specify an additional possible next state

• Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception-causing events such that control logic remains small and fast.

– Complex interactions makes the control unit the most challenging aspect of hardware design

Modification to the Control SpecificationIR <= MEM[PC]

PC <= PC + 4

R-type


S <= A fun B

R[rd] <= S

S <= A op ZX

R[rt] <= S

ORi

S <= A + SX

R[rt] <= M

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= B

SW

other

undefined instruction

EPC <= PC - 4PC <= exp_addrcause <= 10 (RI)

EPC <= PC - 4PC <= exp_addrcause <= 12 (Ovf)

overflow

Additional condition fromDatapath

Equal

BEQ

PC <= PC + SX || 00

0010

0011

S <= A - B ~Equal

Summary• Specialize state-diagrams easily captured by microsequencer

– simple increment & “branch” fields

– datapath control fields

• Control design reduces to Microprogramming

• Exceptions are the hard part of control

• Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system

• As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder

Pipelining

Pipelining is Natural!

• Laundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads

• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Pipelined Laundry: Start work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Pipelining Lessons • Pipelining doesn’t help latency

of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

• Stall for Dependences

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Pipelined Execution

• Utilization?• Now we just have to make it work

IFetch Dcd Exec Mem WB





IFetch Dcd Exec Mem WBProgram Flow

Time

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem

Load Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

Why Pipeline?

• Suppose we execute 100 instructions• Single Cycle Machine

–45 ns/cycle x 1 CPI x 100 inst = 4500 ns

• Multicycle Machine–10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600

ns

• Ideal pipelined machine–10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040

ns

Why Pipeline? Because the resources are there!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

U

Im Reg Dm Reg

AL

U

Im Reg Dm RegA

LU

Im Reg Dm Reg

AL

U

Im Reg Dm Reg

Can pipelining get us into trouble?• Yes: Pipeline Hazards

– structural hazards: attempt to use the same resource two different ways at the same time

• E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)

– data hazards: attempt to use item before it is ready• E.g., one sock of pair in dryer and one in washer; can’t fold

until get sock from washer through dryer• instruction depends on result of prior instruction still in the

pipeline– control hazards: attempt to make a decision before condition is

evaulated• E.g., washing football uniforms and need to get proper

detergent level; need to see after dryer before next load in• branch instructions

• Can always resolve hazards by waiting– pipeline control must detect the hazard– take action (or delay action) to resolve hazards

Summary 1/3• Specialize state-diagrams easily captured by microsequencer

– simple increment & “branch” fields

– datapath control fields

• Control design reduces to Microprogramming

• Exceptions are the hard part of control

• Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system

• As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder

Summary 2/3

• Microprogramming is a fundamental concept– implement an instruction set by building a very simple

processor and interpreting the instructions

–essential for very complex instructions and when few register transfers are possible

• Pipelining is a fundamental concept–multiple steps using distinct resources

• Utilize capabilities of the Datapath by pipelined instruction processing

–start next instruction while working on the current one

– limited by length of longest stage (plus fill/flush)

–detect and resolve hazards

The Five Stages of Load

• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Read the data from the Data Memory• Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Pipelining

• Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Basic Idea

• What do we need to add to actually split the datapath into stages?

Graphically Representing Pipelines

• Can help with answering questions like:–how many cycles does it take to execute this code?–what is the ALU doing during cycle 4?–use this representation to help understand datapaths

Conventional Pipelined Execution Representation






IFetch Dcd Exec Mem WBProgram Flow

Time

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr


Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem

Load Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

Mem

Single Memory is a Structural Hazard

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4A

LU

Mem Reg Mem Reg

AL

U

Mem Reg Mem Reg

AL

U

Mem Reg Mem RegA

LU

Reg Mem Reg

AL

U

Mem Reg Mem Reg

Detection is easy in this case! (right half highlight means read, left half write)

• Stall: wait until decision is clear– Its possible to move up decision to 2nd stage by adding hardware to check

registers as being read

• Impact: 2 clock cycles per branch instruction => slow

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

U

Mem Reg Mem Reg

AL

U

Mem Reg Mem RegA

LU

Reg Mem RegMem

• Predict: guess one direction then back up if wrong– Predict not taken

• Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right ญ 50% of time)

• More dynamic scheme: history of 1 branch (ญ 90%)


Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

U

Mem Reg Mem Reg

AL

U

Mem Reg Mem Reg

MemA

LU

Reg Mem Reg

• Redefine branch behavior (takes place after next instruction) “delayed branch”

• Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (ญ 50% of time)

• As launch more instruction per clock cycle, less useful


Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

U

Mem Reg Mem Reg

AL

U

Mem Reg Mem Reg

MemA

LU

Reg Mem Reg

Load Mem

AL

U

Reg Mem Reg

Data Hazard on r1

add r1 ,r2,r3

sub r4, r1 ,r3

and r6, r1 ,r7

or r8, r1 ,r9

xor r10, r1 ,r11

• Dependencies backwards in time are hazardsData Hazard on

r1:

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF

EX MEM WB

AL

UIm Reg Dm Reg

AL

U

Im Reg Dm RegA

LU

Im Reg Dm Reg

Im

AL

U

Reg Dm Reg

AL

U

Im Reg Dm Reg

• “Forward” result from one stage to another

• “or” OK if define read/write properly

Data Hazard Solution:

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF

EX MEM WB

AL

UIm Reg Dm Reg

AL

U

Im Reg Dm RegA

LU

Im Reg Dm Reg

Im

AL

U

Reg Dm Reg

AL

U

Im Reg Dm Reg

• Dependencies backwards in time are hazards

• Can’t solve with forwarding: • Must delay/stall instruction dependent on loads

Forwarding (or Bypassing): What about Loads

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3

IF

ID/RF

EX MEM WB

AL

UIm Reg Dm Reg

AL

U

Im Reg Dm Reg

Designing a Pipelined Processor

• Go back and examine your datapath and control diagram

• associated resources with states• ensure that flows do not conflict, or figure out how

to resolve• assert control in appropriate stage

Pipelined Processor (almost) for slides

• What happens if we start a new instruction every cycle?

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

emValid

IRex

Dcd

Ctr

l

IRm

em

Ex

Ctr

l

IRw

b

Mem

Ctr

l

WB

Ctr

l

Control and DatapathIR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If CondPC < PC+SX;

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

D

M

Pipelining the Load Instruction

• The five independent functional units in the pipeline datapath are:

–Instruction Memory for the Ifetch stage

–Register File’s Read ports (bus A and busB) for the Reg/Dec stage

–ALU for the Exec stage

–Data Memory for the Mem stage

–Register File’s Write port (bus W) for the Wr stage

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw

The Four Stages of R-type


• Reg/Dec: Registers Fetch and Instruction Decode• Exec:

–ALU operates on the two register operands

–Update PC

• Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec WrR-type

Pipelining the R-type and Load Instruction

• We have pipeline conflict or structural hazard:–Two instructions try to write to the register file at the same

time!

–Only one write port

Clock







Ops! We have a problem!

Important Observation• Each functional unit can only be used once per

instruction• Each functional unit must be used at the same stage for

all instructions:– Load uses Register File’s Write Port during its 5th stage

– R-type uses Register File’s Write Port during its 4th stage


1 2 3 4 5


1 2 3 4

2 ways to solve this pipeline hazard.

Solution 1: Insert “Bubble” into the Pipeline

• Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle

–The control logic can be complex.

–Lose instruction fetch and issue opportunity.

• No instruction is started in Cycle 6!

Clock



Ifetch Reg/Dec Exec



Ifetch Reg/Dec Exec WrR-type Pipeline

Bubble

Ifetch Reg/Dec Exec Wr

Solution 2: Delay R-type’s Write by One Cycle• Delay R-type’s register write by one cycle:

– Now R-type instructions also use Reg File’s write port at Stage 5

– Mem stage is a NOOP stage: nothing is being done.

Clock


Ifetch Reg/Dec Mem WrR-type





Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5

Modified Control & DatapathIR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– M;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– M;

S <– A + SX;

Mem[S] <- B

if Cond PC < PC+SX;

M <– S

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

D

M

M <– S

The Four Stages of Store


• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Write the data into the Data Memory


Ifetch Reg/Dec Exec MemStore Wr

The Three Stages of Beq


• Reg/Dec: –Registers Fetch and Instruction Decode

• Exec: –compares the two register operand,

– select correct branch target address

– latch into PC


Ifetch Reg/Dec Exec MemBeq Wr

Control Diagram IR <- Mem[PC]; PC < PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If Cond PC < PC+SX;

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

D

M <– S M <– S

M

Datapath + Data Stationary Control

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

rs rt

oprsrt

fun

im

exmewbrwv

mewbrwv

wbrwv

Let’s Try it Out

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

these addresses are octal

Start: Fetch 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

rs rt im

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

n n n n

10

Fetch 14, Decode 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

2 rt im

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

n n n

14

lw r

1, r

2(35

)

Fetch 20, Decode 14, Exec 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r2

B

SReg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

2 rt 35

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

n n

20

lw r

1

add

I r2,

r2,

3

Fetch 24, Decode 20, Exec 14, Mem 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r2

B

r2+

35

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

4 5 3

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

n

24lw

r1

sub

r3,

r4,

r5

add

I r2,

r2,

3

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r4

r5

r2+

3

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M[r

2+35

]6 7

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

30

lw r

1

beq

r6,

r7

100

add

I r2

sub

r3


Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r6

r7

r2+

3

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

r1=

M[r

2+35

]

9 xx

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

34

beq ad

dI r

2

sub

r3

r4-r

5

100

ori

r8,

r9

17


Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r9

x

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

11 12

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

100

beq

r2 = r2+3

sub

r3

r4-r

5

17o

ri r

8

xxx

add

r10

, r11

, r12

ooops, we should have only one delayed instruction


Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r11

r12

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

14 15

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

104

beq

r2 = r2+3r3 = r4-r5

xx

ori

r8

xxx

add

r10

and

r13

, r14

, r15 n

Squash the extra instruction in the branch shadow!

r9 |

17


Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r14

r15

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

110

r2 = r2+3r3 = r4-r5

xx

ori

r8

add

r10

and

r13

n

Squash the extra instruction in the branch shadow!r9

| 17

r11+

r12


Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

Reg

File

PC

Nex

t P

C

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

r1=M[r2+35]

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

114

r2 = r2+3r3 = r4-r5

r8 = r9 | 17

add

r10

and

r13

n

Squash the extra instruction in the branch shadow!r1

1+r1

2

NO WBNO Ovflow

r14

& R

15

Summary: Pipelining• What makes it easy

– all instructions are the same length

– just a few instruction formats

– memory operands appear only in loads and stores

• What makes it hard?– structural hazards: suppose we had only one memory

– control hazards: need to worry about branch instructions

– data hazards: an instruction depends on a previous instruction

• We’ll build a simple pipeline and look at these issues

• We’ll talk about modern processors and what really makes it hard:

– exception handling

– trying to improve performance with out-of-order execution, etc.

Summary

• Pipelining is a fundamental concept–multiple steps using distinct resources

• Utilize capabilities of the Datapath by pipelined instruction processing

–start next instruction while working on the current one

– limited by length of longest stage (plus fill/flush)

–detect and resolve hazards

What about Interrupts, Traps, Faults?• External Interrupts:

–Allow pipeline to drain,

–Load PC with interupt address

• Faults (within instruction, restartable)

–Force trap instruction into IF

–disable writes till trap hits WB

–must save multiple PCs or PC + state

Refer to MIPS solution

Exception Handling

npc

I mem

Regs

B

alu

S

D mem

m

IAU

PClw $2,20($5)

Regs

A im op rwn

detect bad instruction address

detect bad instruction

detect overflow

detect bad data address

Allow exception to take effect

Exception Problem

• Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline

– How to stop the pipeline?

– Restart?

– Who caused the interrupt?Stage Problem interrupts occurring

IF Page fault on instruction fetch; misaligned memory access; memory-protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error

• Load with data page fault, Add with instruction page fault?

• Solution 1: interrupt vector/instruction , check last stage

• Solution 2: interrupt ASAP, restart everything incomplete

Resolution: Freeze above & Bubble Below

npc

I mem

Regs

B

alu

S

D mem

m

IAU

PC

Regs

A im op rwn

op rwn

op rwn

op rw rs rt

bubble

freeze

Memory

The Goal: illusion of large, fast, cheap memory

• Fact: Large memories are slow, fast memories are small

• How do we create a memory that is large, cheap and fast (most of the time)?

–Hierarchy

–Parallelism

An Expanded View of the Memory System

Control

Datapath

Memory

Processor

Mem

ory

Memory

MemoryMem

ory

Fastest Slowest

Smallest Biggest

Highest Lowest

Speed:

Size:

Cost:

Why hierarchy works

• The Principle of Locality:–Program access a relatively small portion of the address

space at any instant of time.

Address Space0 2^n - 1

Probabilityof reference

Memory Hierarchy: How Does it Work?

• Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the

processor

• Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper

levels Lower Level

MemoryUpper LevelMemory

To Processor

From ProcessorBlk X

Blk Y

Memory Hierarchy: Terminology

• Hit: data appears in some block in the upper level (example: Block X)

– Hit Rate: the fraction of memory access found in the upper level

– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

• Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)

– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

• Hit Time << Miss Penalty

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Memory Hierarchy of a Modern Computer System• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On

-Ch

ipC

ache

1s 10,000,000s (10s ms)

Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage(Disk)

10,000,000,000s (10s sec)

Ts

How is the hierarchy managed?

• Registers <-> Memory–by compiler (programmer?)

• cache <-> memory–by the hardware

• memory <-> disks–by the hardware and operating system (virtual memory)

–by the programmer (files)

Example: 1 KB Direct Mapped Cache with 32 B Blocks• For a 2 ** N byte cache:

–The uppermost (32 - N) bits are always the Cache Tag

–The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

Extreme Example: single big line

• Cache Size = 4 bytes Block Size = 4 bytes– Only ONE entry in the cache

• If an item is accessed, likely that it will be accessed again soon– But it is unlikely that it will be accessed again immediately!!!

– The next access will likely to be a miss again

• Continually loading data into the cache butdiscard (force out) them before they are used again

• Worst nightmare of a cache designer: Ping Pong Effect

• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index

• Solution 1: make the cache size bigger

• Solution 2: Multiple entries for the same Cache Index

0

Cache DataValid Bit

Byte 0Byte 1Byte 3

Cache Tag

Byte 2

Another Extreme Example: Fully Associative• Fully Associative Cache

– Forget about the Cache Index

– Compare the Cache Tags of all cache entries in parallel

– Example: Block Size = 2 B blocks, we need N 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

A Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel

• Example: Two-way set associative cache

– Cache Index selects a “set” from the cache

– The two tags in the set are compared in parallel

– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Disadvantage of Set Associative Cache• N-way Set Associative Cache versus Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

A Summary on Sources of Cache Misses• Compulsory (cold start or process migration, first

reference): first access to a block– “Cold” fact of life: not a whole lot you can do about it

– Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant

• Conflict (collision):– Multiple memory locations mapped

to the same cache location

– Solution 1: increase cache size

– Solution 2: increase associativity

• Capacity:– Cache cannot contain all blocks access by the program

– Solution: increase cache size

• Invalidation: other process (e.g., I/O) updates memory

Improving Cache Performance: 3 general options

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.

4 Questions for Memory Hierarchy

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level? (Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?

• Block 12 placed in 8 block cache:–Fully associative, direct mapped, 2-way set

associative

–S.A. Mapping = Block Number Modulo Number Sets

Q2: How is a block found if it is in the upper level?

• Tag on each block–No need to check index or block offset

• Increasing associativity shrinks index, expands tag

Q3: Which block should be replaced on a miss?

• Easy for Direct Mapped• Set Associative or Fully Associative:

– Random

– LRU (Least Recently Used)

Associativity: 2-way 4-way 8-way

Size LRURandomLRURandom LRURandom

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17%1.13% 1.13% 1.12% 1.12%

Q4: What happens on a write?

• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

– is block clean or dirty?

• Pros and Cons of each?–WT: read misses cannot result in writes

–WB: no writes of repeated writes

• WT always combined with write buffers so that don’t wait for lower level memory

Write Buffer for Write Through

• A Write Buffer is needed between the Cache and Memory

–Processor: writes data into the cache and the write buffer

–Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:–Typical number of entries: 4

–Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

ProcessorCache

Write Buffer

DRAM

Write-miss Policy: Write Allocate versus Not Allocate• Assume: a 16-bit write to memory location 0x0 and causes a miss

– Do we read in the block?

• Yes: Write Allocate

• No: Write Not Allocate

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x00

Ex: 0x00

0x00

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

Recall: Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns$.01-.001/bit

Main MemoryM Bytes100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents-3 -4

CapacityAccess TimeCost

Tapeinfinitesec-min10-6

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Basic Issues in Virtual Memory System Designsize of information blocks that are transferred from

secondary to main storage (M)

block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy

which region of M is to hold the new block --> placement policy

missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy

Paging Organization

virtual and physical address space partitioned into blocks of equal sizepage frames

pages

pages

reg

cachemem disk

frame

Address MapV = {0, 1, . . . , n - 1} virtual address space

M = {0, 1, . . . , m - 1} physical address space

MAP: V --> M U {0} address mapping function

n > m

MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M

= 0 if data at virtual address a is not present in M

Processor

Name Space V

Addr TransMechanism

faulthandler

MainMemory

SecondaryMemory

a

aa'

0

missing item fault

physical address OS performsthis transfer

Paging Organizationframe 0

1

7

01024

7168

P.A.

PhysicalMemory

1K1K

1K

AddrTransMAP

page 01

31

1K1K

1K

01024

31744

unit of mapping

also unit oftransfer fromvirtual tophysical memory

Virtual Memory

Address Mapping

VA page no. disp10

Page Table

indexintopagetable

Page TableBase Reg

V AccessRights PA +

table locatedin physicalmemory

physicalmemoryaddress

actually, concatenation is more likely

V.A.

Virtual Address and a Cache

CPUTrans-lation

Cache MainMemory

VA PA miss

hitdata

It takes an extra memory access to translate VA to PA

This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible

ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!

for update: must update all cache entries with same physical address or memory becomes inconsistent

determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or

software enforced alias boundary: same lsb of VA &PA > cache size

TLBsA way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB

Virtual Address Physical Address Dirty Ref Valid Access

TLB access time comparable to cache access time (much less than main memory access time)

Translation Look-Aside Buffers

Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.

CPUTLB

LookupCache Main

Memory

VA PA miss

hit

data

Trans-lation

hit

miss

20 tt1/2 t

Translationwith a TLB

Summary #1/ 4:• The Principle of Locality:

– Program likely to access a relatively small portion of the address space at any instant of time.

• Temporal Locality: Locality in Time

• Spatial Locality: Locality in Space

• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.

– Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

– Capacity Misses: increase cache size

• Cache Design Space– total size, block size, associativity

– replacement policy

– write-hit policy (write-through, write-back)

– write-miss policy

Summary #2 / 4: The Cache Design Space

• Several interacting dimensions–cache size

–block size

–associativity

– replacement policy

–write-through vs write-back

–write allocation

Associativity

Cache Size

Block Size

Summary #3 / 4 : TLB, Virtual Memory• Caches, TLBs, Virtual Memory all understood by

examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled?

• Page tables map virtual address to physical address• TLBs are important for fast translation• TLB misses are significant in processor performance:

(funny times, as most systems can’t access all of 2nd level cache without TLB misses!)

Summary #4 / 4: Memory Hierachy

• Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy

• Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?

Single Cycle datapath. How to Design a Processor: step-by-step 1. Analyze instruction set => datapath requirements –the meaning of each instruction is.

Documents

register transfersdatapath

register transfer2

clk input

data outwrite

busw data

beq rs

imm16sw rt

ori rt