Single Cycle datapath
Jan 17, 2016
Single Cycle datapath
How to Design a Processor: step-by-step
• 1. Analyze instruction set => datapath requirements– the meaning of each instruction is given by the register transfers
–datapath must include storage element for ISA registers
• possibly more
–datapath must support each register transfer
• 2. Select set of datapath components and establish clocking methodology
• 3. Assemble datapath meeting the requirements• 4. Analyze implementation of each instruction to determine
setting of control points that effects the register transfer.• 5. Assemble the control logic
The MIPS Instruction Formats
• All MIPS instructions are 32 bits long. The three instruction formats:
– R-type
– I-type
– J-type
• The different fields are:– op: operation of the instruction– rs, rt, rd: the source and destination register specifiers– shamt: shift amount– funct: selects the variant of the operation in the “op” field– address / immediate: address offset or immediate value– target address: target address of the jump instruction
op target address
02631
6 bits 26 bits
op rs rt rd shamt funct
061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
Step 1a: The MIPS-lite Subset for today
• ADD and SUB–addU rd, rs, rt
– subU rd, rs, rt
• OR Immediate:–ori rt, rs, imm16
• LOAD and STORE Word– lw rt, rs, imm16
–sw rt, rs, imm16
• BRANCH:–beq rs, rt, imm16
op rs rt rd shamt funct
061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
Logical Register Transfers
• RTL gives the meaning of the instructions• All start by fetching the instruction
op | rs | rt | rd | shamt | funct = MEM[ PC ]
op | rs | rt | Imm16 = MEM[ PC ]
inst Register Transfers
ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4
SUBU R[rd] <– R[rs] – R[rt]; PC <– PC + 4
ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4
LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4
STORE MEM[ R[rs] + sign_ext(Imm16) ] <– R[rt]; PC <– PC + 4
BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || 00
else PC <– PC + 4
Step 1: Requirements of the Instruction Set
• Memory–instruction & data
• Registers (32 x 32)–read RS
–read RT
–Write RT or RD
• PC• Extender• Add and Sub register or extended immediate• Add 4 or extended immediate to PC
Step 2: Components of the Datapath
• Combinational Elements• Storage Elements
–Clocking methodology
Combinational Logic Elements (Basic Building Blocks)
• Adder
• MUX
• ALU
32
32
A
B32
Sum
Carry
32
32
A
B32
Result
OP
32A
B32
Y32
Select
Ad
der
MU
XA
LU
CarryIn
Storage Element: Register (Basic Building Block)
• Register–Similar to the D Flip Flop except
• N-bit input and output
• Write Enable input
–Write Enable:• negated (0): Data Out will not change
• asserted (1): Data Out will become Data In
Clk
Data In
Write Enable
N N
Data Out
Storage Element: Register File• Register File consists of 32 registers:
–Two 32-bit output busses: busA and busB–One 32-bit input bus: busW
• Register is selected by:–RA (number) selects the register to put on busA (data)–RB (number) selects the register to put on busB (data)–RW (number) selects the register to be written
via busW (data) when Write Enable is 1
• Clock input (CLK) –The CLK input is a factor ONLY during write operation–During read operation, behaves as a combinational logic
block:• RA or RB valid => busA or busB valid after “access
time.”
Clk
busW
Write Enable
3232
busA
32busB
5 5 5RWRARB
32 32-bitRegisters
Storage Element: Idealized Memory
• Memory (idealized)–One input bus: Data In
–One output bus: Data Out
• Memory word is selected by:–Address selects the word to put on Data Out
–Write Enable = 1: address selects the memoryword to be written via the Data In bus
• Clock input (CLK) –The CLK input is a factor ONLY during write operation
–During read operation, behaves as a combinational logic block:
• Address valid => Data Out valid after “access time.”
Clk
Data In
Write Enable
32 32DataOut
Address
Clocking Methodology
• All storage elements are clocked by the same clock edge
• Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock Skew
• (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Clk
Don’t Care
Setup Hold
.
.
.
.
.
.
.
.
.
.
.
.
Setup Hold
Step 3
• Register Transfer Requirements –> Datapath Assembly
• Instruction Fetch• Read Operands and Execute Operation
3a: Overview of the Instruction Fetch Unit
• The common RTL operations–Fetch the Instruction: mem[PC]
–Update the program counter:• Sequential Code: PC <- PC + 4
• Branch and Jump: PC <- “something else”
32
Instruction WordAddress
InstructionMemory
PCClk
Next AddressLogic
3b: Add & Subtract
• R[rd] <- R[rs] op R[rt] Example: addU rd, rs, rt–Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields
–ALUctr and RegWr: control logic after decoding the instruction
32
Result
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
5 5 5
Rw Ra Rb
32 32-bitRegisters
Rs RtRd
AL
Uop rs rt rd shamt funct
061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
Register-Register Timing
32Result
ALUctr
Clk
busW
RegWr
3232
busA
32busB
5 5 5
Rw Ra Rb32 32-bitRegisters
Rs RtRd
AL
U
Clk
PC
Rs, Rt, Rd,Op, Func
Clk-to-Q
ALUctr
Instruction Memory Access Time
Old Value New Value
RegWr Old Value New Value
Delay through Control Logic
busA, BRegister File Access Time
Old Value New Value
busW
ALU Delay
Old Value New Value
Old Value New Value
New ValueOld Value
Register WriteOccurs Here
3c: Logical Operations with Immediate• R[rt] <- R[rs] op ZeroExt[imm16] ]
32
Result
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
5 5 5
Rw Ra Rb
32 32-bitRegisters
Rs
RtRdRegDst
ZeroE
xt
Mu
x
Mux
3216imm16
ALUSrc
AL
U
11
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits rd?
immediate
016 1531
16 bits16 bits
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3d: Load Operations
• R[rt] <- Mem[R[rs] + SignExt[imm16]] Example: lw rt, rs, imm16
11
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits rd
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
5 5 5
Rw Ra Rb
32 32-bitRegisters
Rs
RtRd
RegDst
Exten
der
Mu
x
Mux
3216
imm16
ALUSrc
ExtOp
Clk
Data InWrEn
32
Adr
DataMemory
32
AL
U
MemWr Mu
x
W_Src
3e: Store Operations
• Mem[ R[rs] + SignExt[imm16] <- R[rt] ] Example: sw rt, rs, imm16
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
Rd
RegDst
Exten
der
Mu
x
Mux
3216imm16
ALUSrcExtOp
Clk
Data InWrEn
32
Adr
DataMemory
MemWr
AL
U
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
32
Mu
x
W_Src
3f: The Branch Instruction
• beq rs, rt, imm16
–mem[PC] Fetch the instruction from memory
–Equal <- R[rs] == R[rt] Calculate the branch condition
– if (COND eq 0) Calculate the next instruction’s address
• PC <- PC + 4 + ( SignExt(imm16) x 4 )– else
• PC <- PC + 4
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
Datapath for Branch Operations
• beq rs, rt, imm16Datapath generates condition (equal)
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
32
imm16
PC
Clk
00
Ad
der
Mu
x
Ad
der
4nPC_sel
Clk
busW
RegWr
32
busA
32
busB
5 5 5
Rw Ra Rb
32 32-bitRegisters
Rs Rt
Eq
ual
?
Cond
PC
Ext
Inst Address
Putting it All Together: A Single Cycle Datapathim
m16
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
3216imm16
ALUSrcExtOp
Mu
x
MemtoReg
Clk
Data InWrEn32 Adr
DataMemory
MemWrA
LU
Equal
Instruction<31:0>
0
1
0
1
01
<21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRtRs
=
Ad
der
Ad
der
PC
Clk
00
Mu
x
4
nPC_sel
PC
Ext
Adr
InstMemory
Step 4: Given Datapath: RTL -> Control
ALUctrRegDst ALUSrcExtOp MemtoRegMemWr Equal
Instruction<31:0>
<21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel
Adr
InstMemory
DATA PATH
Control
Op
<21:25>
Fun
RegWr
Meaning of the Control Signals• ExtOp: “zero”, “sign”
• ALUsrc: 0 => regB; 1 => immed
• ALUctr: “add”, “sub”, “or”
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
3216imm16
ALUSrcExtOp
Mu
x
MemtoReg
Clk
Data InWrEn32 Adr
DataMemory
MemWr
AL
U
Equal
0
1
0
1
01
MemWr: write memory
MemtoReg: 1 => Mem
RegDst: 0 => “rt”; 1 => “rd”
RegWr: write dest register
=
Example: Load Instruction
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
3216imm16
ALUSrcExtOp
Mu
x
MemtoReg
Clk
Data InWrEn32 Adr
DataMemory
MemWrA
LU
Equal
Instruction<31:0>
0
1
0
1
01
<21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRtRs
=
imm
16
Ad
der
Ad
der
PC
Clk
00
Mu
x
4
nPC_sel
PC
Ext
Adr
InstMemory
sign ext
addrt+4
An Abstract View of the Implementation
• Logical vs. Physical Structure
DataOut
Clk
5
Rw Ra Rb
32 32-bitRegisters
Rd
AL
U
Clk
Data In
DataAddress
IdealData
Memory
Instruction
InstructionAddress
IdealInstruction
Memory
Clk
PC
5Rs
5Rt
32
323232
A
BNex
t A
dd
ress
Control
Datapath
Control Signals Conditions
Summary
• 5 steps to design a processor– 1. Analyze instruction set => datapath requirements
– 2. Select set of datapath components & establish clock methodology
– 3. Assemble datapath meeting the requirements
– 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.
– 5. Assemble the control logic
• MIPS makes it easier– Instructions same size
– Source registers always in same place
– Immediates same size, location
– Operations always on registers/immediates
• Single cycle datapath => CPI=1, CCT => long
• Next time: implementing control
Recap: A Single Cycle Datapath
• We have everything except control signals (underline)–Today’s lecture will show you how to generate the control
signals
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
Mux
3216imm16
ALUSrc
ExtOp
Mu
x
MemtoReg
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWrA
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel
RTL: The Add Instruction
• add rd, rs, rt
–mem[PC] Fetch the instruction from memory
–R[rd] <- R[rs] + R[rt] The actual operation
–PC <- PC + 4 Calculate the next instruction’s
address
op rs rt rd shamt funct
061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
Instruction Fetch Unit at the Beginning of Add
PC
Ext
• Fetch the instruction from Instruction memory: Instruction <- mem[PC]
– This is the same for all instructions
Adr
InstMemory
Ad
der
Ad
der
PC
Clk
00
Mu
x
4
nPC_sel
imm
16
Instruction<31:0>
The Single Cycle Datapath during Add
32
ALUctr = Add
Clk
busW
RegWr = 1
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst = 1
Exten
der
Mu
x
Mux
3216imm16
ALUSrc = 0
ExtOp = x
Mu
x
MemtoReg = 0
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWr = 0A
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>• R[rd] <- R[rs] + R[rt]
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
op rs rt rd shamt funct
061116212631
nPC_sel= +4
Instruction Fetch Unit at the End of Add• PC <- PC + 4
–This is the same for all instructions except: Branch and Jump
Adr
InstMemory
Ad
der
Ad
der
PC
Clk
00
Mu
x
4
nPC_sel
imm
16
Instruction<31:0>
The Single Cycle Datapath during Or Immediate
• R[rt] <- R[rs] or ZeroExt[Imm16]
op rs rt immediate
016212631
32
ALUctr =
Clk
busW
RegWr =
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst =
Exten
der
Mu
x
Mux
3216imm16
ALUSrc =
ExtOp =
Mu
x
MemtoReg =
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWr = A
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel =
The Single Cycle Datapath during Load
32
ALUctr = Add
Clk
busW
RegWr = 1
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst = 0
Exten
der
Mu
x
Mux
3216imm16
ALUSrc = 1
ExtOp = 1
Mu
x
MemtoReg = 1
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWr = 0A
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
• R[rt] <- Data Memory {R[rs] + SignExt[imm16]}
op rs rt immediate
016212631
nPC_sel= +4
The Single Cycle Datapath during Store
• Data Memory {R[rs] + SignExt[imm16]} <- R[rt]
op rs rt immediate
016212631
32
ALUctr =
Clk
busW
RegWr =
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst =
Exten
der
Mu
x
Mux
3216imm16
ALUSrc =
ExtOp =
Mu
x
MemtoReg =
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWr = A
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel =
The Single Cycle Datapath during Store
32
ALUctr = Add
Clk
busW
RegWr = 0
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst = x
Exten
der
Mu
x
Mux
3216imm16
ALUSrc = 1
ExtOp = 1
Mu
x
MemtoReg = x
Clk
Data InWrEn
32Adr
DataMemory
32
MemWr = 1A
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
• Data Memory {R[rs] + SignExt[imm16]} <- R[rt]
op rs rt immediate
016212631
nPC_sel= +4
The Single Cycle Datapath during Branch
32
ALUctr = Subtract
Clk
busW
RegWr = 0
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst = x
Exten
der
Mu
x
Mux
3216imm16
ALUSrc = 0
ExtOp = x
Mu
x
MemtoReg = x
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWr = 0A
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
• if (R[rs] - R[rt] == 0) then Zero <- 1 ; else Zero <- 0
op rs rt immediate
016212631
nPC_sel= “Br”
Instruction Fetch Unit at the End of Branch
• if (Zero == 1) then PC = PC + 4 + SignExt[imm16]*4 ; else PC = PC + 4
op rs rt immediate
016212631
Adr
InstMemory
Ad
der
Ad
der
PC
Clk
00
Mu
x
4
nPC_sel
imm
16
Instruction<31:0>
Step 4: Given Datapath: RTL -> Control
ALUctrRegDst ALUSrcExtOp MemtoRegMemWr Equal
Instruction<31:0>
<21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel
Adr
InstMemory
DATA PATH
Control
Op
<21:25>
Fun
RegWr
A Summary of Control Signals
inst Register Transfer
ADD R[rd] <– R[rs] + R[rt]; PC <– PC + 4
ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, nPC_sel = “+4”
SUB R[rd] <– R[rs] – R[rt]; PC <– PC + 4
ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, nPC_sel = “+4”
ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4
ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, nPC_sel = “+4”
LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4
ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemtoReg, RegDst = rt, RegWr, nPC_sel = “+4”
STORE MEM[ R[rs] + sign_ext(Imm16)] <– R[rs]; PC <– PC + 4
ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, nPC_sel = “+4”
BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || 00 else PC <– PC + 4
nPC_sel = “Br”, ALUctr = “sub”
A Summary of the Control Signals
add sub ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
nPCsel
Jump
ExtOp
ALUctr<2:0>
1
0
0
1
0
0
0
x
Add
1
0
0
1
0
0
0
x
Subtract
0
1
0
1
0
0
0
0
Or
0
1
1
1
0
0
0
1
Add
x
1
x
0
1
0
0
1
Add
x
0
x
0
0
1
0
x
Subtract
x
x
x
0
0
0
1
x
xxx
op target address
op rs rt rd shamt funct
061116212631
op rs rt immediate
R-type
I-type
J-type
add, sub
ori, lw, sw, beq
jump
func
op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010Appendix A10 0000See 10 0010 We Don’t Care :-)
The Concept of Local Decoding
R-type ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ExtOp
ALUop<N:0>
1
0
0
1
0
0
0
x
“R-type”
0
1
0
1
0
0
0
0
Or
0
1
1
1
0
0
0
1
Add
x
1
x
0
1
0
0
1
Add
x
0
x
0
0
1
0
x
Subtract
x
x
x
0
0
0
1
x
xxx
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
MainControl
op
6
ALUControl(Local)
func
N
6ALUop
ALUctr
3
AL
U
The Encoding of ALUop
• In this exercise, ALUop has to be 2 bits wide to represent:– (1) “R-type” instructions
– “I-type” instructions that require the ALU to perform:
• (2) Or, (3) Add, and (4) Subtract
• To implement the full MIPS ISA, ALUop has to be 3 bits to represent:
– (1) “R-type” instructions
– “I-type” instructions that require the ALU to perform:
• (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi)
MainControl
op
6
ALUControl(Local)
func
N
6ALUop
ALUctr
3
R-type ori lw sw beq jump
ALUop (Symbolic) “R-type” Or Add Add Subtract xxx
ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx
The Decoding of the “func” Field
R-type ori lw sw beq jump
ALUop (Symbolic) “R-type” Or Add Add Subtract xxx
ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx
MainControl
op
6
ALUControl(Local)
func
N
6ALUop
ALUctr
3
op rs rt rd shamt funct
061116212631
R-type
funct<5:0> Instruction Operation
10 0000
10 0010
10 0100
10 0101
10 1010
add
subtract
and
or
set-on-less-than
ALUctr<2:0> ALU Operation
000
001
010
110
111
Add
Subtract
And
Or
Set-on-less-than
Recall ALU Homework (also P. 286 text):
ALUctr
AL
U
The Truth Table for ALUctr
R-type ori lw sw beqALUop(Symbolic) “R-type” Or Add Add Subtract
ALUop<2:0> 1 00 0 10 0 00 0 00 0 01
ALUop func
bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3>
0 0 0 x x x x
ALUctrALUOperation
Add 0 1 0
bit<2> bit<1> bit<0>
0 x 1 x x x x Subtract 1 1 0
0 1 x x x x x Or 0 0 1
1 x x 0 0 0 0 Add 0 1 0
1 x x 0 0 1 0 Subtract 1 1 0
1 x x 0 1 0 0 And 0 0 0
1 x x 0 1 0 1 Or 0 0 1
1 x x 1 0 1 0 Set on < 1 1 1
funct<3:0> Instruction Op.
0000
0010
0100
0101
1010
add
subtract
and
or
set-on-less-than
The Logic Equation for ALUctr<2>ALUop func
bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3> ALUctr<2>
0 x 1 x x x x 1
1 x x 0 0 1 0 1
1 x x 1 0 1 0 1
• ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !
func<0>
This makes func<3> a don’t care
The Logic Equation for ALUctr<1>ALUop func
bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3>
0 0 0 x x x x 1
ALUctr<1>
0 x 1 x x x x 1
1 x x 0 0 0 0 1
1 x x 0 0 1 0 1
1 x x 1 0 1 0 1
• ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0>
The Logic Equation for ALUctr<0>ALUop func
bit<2> bit<1> bit<0> bit<2> bit<1> bit<0>bit<3> ALUctr<0>
0 1 x x x x x 1
1 x x 0 1 0 1 1
1 x x 1 0 1 0 1
• ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !
func<1> & func<0>
+ ALUop<2> & func<3> & !func<2> & func<1> & !func<0>
The ALU Control Block
ALUControl(Local)
func
3
6ALUop
ALUctr
3
• ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0>
• ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0>
• ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> &
func<0>
+ ALUop<2> & func<3> & !func<2> & func<1> & !func<0>
The “Truth Table” for the Main Control
R-type ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ExtOp
ALUop (Symbolic)
1
0
0
1
0
0
0
x
“R-type”
0
1
0
1
0
0
0
0
Or
0
1
1
1
0
0
0
1
Add
x
1
x
0
1
0
0
1
Add
x
0
x
0
0
1
0
x
Subtract
x
x
x
0
0
0
1
x
xxx
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
ALUop <2> 1 0 0 0 0 x
ALUop <1> 0 1 0 0 0 x
ALUop <0> 0 0 0 0 1 x
MainControl
op
6
ALUControl(Local)
func
3
6
ALUop
ALUctr
3
RegDst
ALUSrc
:
Putting it All Together: A Single Cycle Processor
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
Mux
3216imm16
ALUSrc
ExtOp
Mu
x
MemtoReg
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWrA
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
MainControl
op
6
ALUControlfunc
6
3
ALUopALUctr
3RegDst
ALUSrc
:Instr<5:0>
Instr<31:26>
Instr<15:0>
nPC_sel
Worst Case Timing (Load)Clk
PC
Rs, Rt, Rd,Op, Func
Clk-to-Q
ALUctr
Instruction Memoey Access Time
Old Value New Value
RegWr Old Value New Value
Delay through Control Logic
busA
Register File Access Time
Old Value New Value
busB
ALU Delay
Old Value New Value
Old Value New Value
New ValueOld Value
ExtOp Old Value New Value
ALUSrc Old Value New Value
MemtoReg Old Value New Value
Address Old Value New Value
busW Old Value New
Delay through Extender & Mux
RegisterWrite Occurs
Data Memory Access Time
Drawback of this Single Cycle Processor
• Long cycle time:–Cycle time must be long enough for the load instruction:
PC’s Clock -to-Q +
Instruction Memory Access Time +
Register File Access Time +
ALU Delay (address calculation) +
Data Memory Access Time +
Register File Setup Time +
Clock Skew
• Cycle time for load is much longer than needed for all other instructions
ฐSingle cycle datapath => CPI=1, CCT => long
ฐ5 steps to design a processor• 1. Analyze instruction set => datapath requirements
• 2. Select set of datapath components & establish clock methodology
• 3. Assemble datapath meeting the requirements
• 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.
• 5. Assemble the control logic
ฐControl is the hard part
ฐMIPS makes control easier• Instructions same size
• Source registers always in same place
• Immediates same size, location
• Operations always on registers/immediates
Summary
Control
Datapath
Memory
ProcessorInput
Output
Multicycle Datapath
Partitioning the CPI=1 Datapath
• Add registers between smallest steps
PC
Nex
t P
C
Ope
rand
Fet
ch Exec Reg
. F
ile
Mem
Acc
ess
Dat
aM
emInst
ruct
ion
Fet
ch
Res
ult
Sto
reAL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
Mem
Wr
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
Example Multicycle Datapath
• Critical Path ?
PC
Nex
t P
C
Ope
rand
Fet
ch
Ext
ALU Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult
Sto
re
AL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
IRA
B
R
M
RegFile
Mem
ToR
eg
Equ
al
Recall: Step-by-step Processor Design
Step 1: ISA => Logical Register Transfers
Step 2: Components of the Datapath
Step 3: RTL + Components => Datapath
Step 4: Datapath + Logical RTs => Physical RTs
Step 5: Physical RTs => Control
Step 4: R-rtype (add, sub, . . .)
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
ADDU A<– R[rs]; B <– R[rt]
S <– A + B
R[rd] <– S; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4:Logical immed
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
ADDU R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
ADDU A<– R[rs]; B <– R[rt]
S <– A or ZeroExt(Im16)
R[rt] <– S; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4 : Load
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
LW R[rt] <– MEM(R[rs] + sx(Im16);
PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
LW A<– R[rs]; B <– R[rt]
S <– A + SignEx(Im16)
M <– MEM[S]
R[rd] <– M; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4 : Store
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
SW MEM(R[rs] + sx(Im16) <– R[rt];
PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
SW A<– R[rs]; B <– R[rt]
S <– A + SignEx(Im16);
MEM[S] <– B PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4 : Branch
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
BEQ if R[rs] == R[rt]
then PC <= PC + sx(Im16) || 00
else PC <= PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
BEQ|Eq PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
BEQ|Eq PC <– PC + sx(Im16) || 00
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Alternative datapath (book): Multiple Cycle Datapath• Miminizes Hardware: 1 memory, 1 adder
IdealMemoryWrAdrDin
RAdr
32
32
32Dout
MemWr
32
AL
U
3232
ALUOp
ALUControl
Instru
ction R
eg
32
IRWr
32
Reg File
Ra
Rw
busW
Rb5
5
32busA
32busB
RegWr
Rs
Rt
Mu
x
0
1
Rt
Rd
PCWr
ALUSelA
Mux 01
RegDst
Mu
x
0
1
32
PC
MemtoReg
Extend
ExtOp
Mu
x0
132
0
1
23
4
16Imm 32
<< 2
ALUSelB
Mu
x1
0
Target32
Zero
ZeroPCWrCond PCSrc BrWr
32
IorD
AL
U O
ut
Our Control Model
• State specifies control points for Register Transfer
• Transfer occurs upon exiting state (same falling edge)
Control State
Next StateLogic
Output Logic
inputs (conditions)
outputs (control points)
State X
Register TransferControl Points
Depends on Input
Step 4 => Control Specification for multicycle proc
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode / operand fetch”
Exe
cute
Mem
ory
Writ
e-ba
ck
Step 5: datapath + state diagram => control
• Translate RTs into control points• Assign states
• Then go build the controller
Mapping RTs to Control PointsIR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
imem_rd, IRen
ALUfun, Sen
RegDst, RegWr,PCen
Aen, Ben
Assigning States
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
0000
0001
0100
0101
0110
0111
1000
1001
1010
0011 00101011
1100
Detailed Control Specification
0000 ?????? ? 0001 10001 BEQ 0 0011 1 10001 BEQ 1 0010 1 10001 R-type x 0100 1 10001 orI x 0110 1 10001 LW x 1000 1 10001 SW x 1011 1 10010 xxxxxx x 0000 1 10011 xxxxxx x 0000 1 00100 xxxxxx x 0101 0 1 fun 10101 xxxxxx x 0000 1 0 0 1 10110 xxxxxx x 0111 0 0 or 10111 xxxxxx x 0000 1 0 0 1 01000 xxxxxx x 1001 1 0 add 11001 xxxxxx x 1010 1 0 01010 xxxxxx x 0000 1 0 1 1 01011 xxxxxx x 1100 1 0 add 11100 xxxxxx x 0000 1 0 0 1
State Op field Eq Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst
R:
ORi:
LW:
SW:
-all same in Moore machine
Controller Design
• The state digrams that arise define the controller for an instruction set processor are highly structured
• Use this structure to construct a simple “microsequencer”
• Control reduces to programming this very simple device
– microprogramming
sequencercontrol
datapath control
micro-PCsequencer
microinstruction
Example: Jump-Counter
op-codeMap ROM
Counterzeroincload
0000i
i+1
i
Using a Jump Counter
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
0000
0001
0100
0101
0110
0111
1000
1001
1010
0011 00101011
1100
inc
load inc
zero zero
zero zero
zero zeroinc inc inc inc
inc
Our Microsequencer
op-code
Map ROM
Micro-PC
Z I Ldatapath control
taken
Microprogram Control Specification
0000 ? inc 10001 0 load0001 1 inc0010 x zero 1 10011 x zero 1 00100 x inc 0 1 fun 10101 x zero 1 0 0 1 10110 x inc 0 0 or 10111 x zero 1 0 0 1 01000 x inc 1 0 add 11001 x inc 1 0 01010 x zero 1 0 1 1 01011 x inc 1 0 add 11100 x zero 1 0 0 1
ตPC Taken Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst
R:
ORi:
LW:
SW:
BEQ
Mapping ROM
R-type 000000 0100
BEQ 000100 0011
ori 001101 0110
LW 100011 1000
SW 101011 1011
Overview of Control• Control may be designed using one of several initial representations. The
choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique.
Initial Representation Finite State Diagram Microprogram
Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs
Logic Representation Logic Equations Truth Tables
Implementation PLA ROM Technique “hardwired control” “microprogrammed control”
Summary
• Disadvantages of the Single Cycle Processor
–Long cycle time
–Cycle time is too long for all instructions except the Load
• Multiple Cycle Processor:
–Divide the instructions into smaller steps
–Execute each step (instead of the entire instruction) in one cycle
• Partition datapath into equal size chunks to minimize cycle time
– ~10 levels of logic between latches
• Follow same 5-step method for designing “real” processor
Summary (cont’d)
• Control is specified by finite state digram• Specialize state-diagrams easily captured by
microsequencer–simple increment & “branch” fields–datapath control fields
• Control design reduces to Microprogramming • Control is more complicated with:
–complex instruction sets– restricted datapaths (see the book)
• Simple Instruction set and powerful datapath => simple control
–could try to reduce hardware (see the book)– rather go for speed => many instructions at once!
Our Controller FSM Spec IR <= MEM[PC]
PC <= PC + 4
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= S
S <= A op ZX
R[rt] <= S
ORi
S <= A + SX
R[rt] <= M
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= B
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
0000
0001
0100
0101
0110
0111
1000
1001
1010
1011
1100
~EqualEqual
BEQ
PC <= PC + SX || 00
0010
0011
S <= A - B
Microprogramming
• Control is the hard part of processor designฐ Datapath is fairly regular and well-organized
ฐ Memory is highly regular
ฐ Control is irregular and globalMicroprogramming:
-- A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations
Microarchitecture:
-- Logical structure and functional capabilities of the hardware as seen by the microprogrammer
Historical Note:
IBM 360 Series first to distinguish between architecture & organizationSame instruction set across wide range of implementations, each with different cost/performance
Sequencer-based control unit
Opcode
State Reg
Inputs
Outputs
Control Logic MulticycleDatapath
1
Address Select Logic
Adder
Types of “branching”• Set state to 0• Dispatch (state 1)• Use incremented state number
Designing a Microinstruction Set
1) Start with list of control signals
2) Group signals together that make sense (vs. random): called “fields”
3) Places fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last)
4) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals
–Use computers to design computers
5) To minimize the width, encode operations that will never be used at the same time
1&2) Start with list of control signals, grouped into fieldsSignal name Effect when deasserted Effect when asserted
ALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs]RegWrite None Reg. is written MemtoReg Reg. write data input = ALU Reg. write data input = memory RegDst Reg. dest. no. = rt Reg. dest. no. = rdTargetWrite None Target reg. = ALU MemRead None Memory at address is readMemWrite None Memory at address is written IorD Memory address = PC Memory address = ALUIRWrite None IR = MemoryPCWrite None PC = PCSourcePCWriteCond None IF ALUzero then PC = PCSource
Sin
gle
Bit
Con
trol
Signal name Value Effect ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code
11 ALU does logical OR ALUSelB 000 2nd ALU input = Reg[rt] 001 2nd ALU input = 4 010 2nd ALU input = sign extended IR[15-0] 011 2nd ALU input = sign extended, shift left 2 IR[15-0]
100 2nd ALU input = zero extended IR[15-0] PCSource 00 PC = ALU 01 PC = Target 10 PC = PC+4[29-26] : IR[25–0] << 2
Mu
ltip
le B
it C
ontr
ol
Start with list of control signals, cont’d
• For next state function (next microinstruction address), use Sequencer-based control unit from last lecture
– Called “microPC” or “PC” vs. state register
Signal Value Effect Sequen 00 Next address = 0 -cing 01 Next address = dispatch ROM
10 Next address = address + 1
Opcode
microPC
1
ตAddressSelectLogic
Adder
ROM
Mux
0012
3) Microinstruction Format: unencoded vs. encoded fields
Field Name Width Control Signals Set
wide narrowALU Control 4 2 ALUOp
SRC1 2 1 ALUSelA
SRC2 5 3 ALUSelB
ALU Destination 6 4 RegWrite, MemtoReg, RegDst, TargetWr.
Memory 4 3 MemRead, MemWrite, IorD
Memory Register 1 1 IRWrite
PCWrite Control 5 4 PCWrite, PCWriteCond, PCSource
Sequencing 3 2 AddrCtl
Total width 30 20 bits
4) Legend of Fields and Symbolic NamesField Name Values for Field Function of Field with Specific Value
ALU Add ALU addsSubt. ALU subtractsFunc code ALU does function codeOr ALU does logical OR
SRC1 PC 1st ALU input = PCrs 1st ALU input = Reg[rs]
SRC2 4 2nd ALU input = 4Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0]rt 2nd ALU input = Reg[rt]
ALU destination Target Target = ALUoutrd Reg[rd] = ALUout
Memory Read PC Read memory using PCRead ALU Read memory using ALU outputWrite ALU Write memory using ALU output
Memory register IR IR = MemWrite rt Reg[rt] = MemRead rt Mem = Reg[rt]
PC write ALU PC = ALU outputTarget-cond. IF ALU Zero then PC = Targetjump addr. PC = PCSource
Sequencing Seq Go to sequential ตinstructionFetch Go to the first microinstructionDispatch Dispatch using ROM.
Microprogramming Pros and Cons
• Ease of design
• Flexibility– Easy to adapt to changes in organization, timing, technology
– Can make changes late in design cycle, or even in the field
• Can implement very powerful instruction sets (just more control memory)
• Generality– Can implement multiple instruction sets on same machine.
– Can tailor instruction set to application.
• Compatibility– Many organizations, same instruction set
• Costly to implement
• Slow
Exceptions
• Exception = unprogrammed control transfer–system takes action to handle the exception
• must record the address of the offending instruction–returns control to user–must save & restore user state
• Allows constuction of a “user virtual machine”
user program
normal control flow: sequential, jumps, branches, calls, returns
SystemExceptionHandlerException:
return fromexception
What happens to Instruction with Exception?
• MIPS architecture defines the instruction as having no effect if the instruction causes an exception.
• When get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state.
• This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard
Two Types of Exceptions
• Interrupts–caused by external events–asynchronous to program execution–may be handled between instructions–simply suspend and resume user program
• Traps–caused by internal events
• exceptional conditions (overflow)• errors (parity)• faults (non-resident page)
–synchronous to program execution–condition must be remedied by the handler– instruction may be retried or simulated and program
continued or program may be aborted
MIPS convention:• exception means any unexpected change in control flow,
without distinguishing internal or external; use the term interrupt only when the event is externally caused.
Type of event From where? MIPS terminologyI/O device request External InterruptInvoke OS from user program InternalExceptionArithmetic overflow Internal ExceptionUsing an undefined instruction InternalExceptionHardware malfunctions Either Exception or
Interrupt
Additions to MIPS ISA to support Exceptions?
• EPC–a 32-bit register used to hold the address of the affected instruction (register 14 of coprocessor 0).
• Cause–a register used to record the cause of the exception. In the MIPS architecture this register is 32 bits, though some bits are currently unused. Assume that bits 5 to 2 of this register encodes the two possible exception sources mentioned above: undefined instruction=0 and arithmetic overflow=1 (register 13 of coprocessor 0).
• BadVAddr - register contained memory address at which memory reference occurred (register 8 of coprocessor 0)
• Status - interrupt mask and enable bits (register 12 of coprocessor 0)
• Control signals to write EPC , Cause, BadVAddr, and Status
• Be able to write exception address into PC, increase mux to add as input 01000000 00000000 00000000 01000000two (8000 0080hex)
• May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor); PC = PC - 4
How Control Detects Exceptions in our FSD
• Undefined Instruction–detected when no next state is defined from state 1 for the op value.
– We handle this exception by defining the next state value for all op values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12.
– Shown symbolically using “other” to indicate that the op field does not match any of the opcodes that label arcs out of state 1.
• Arithmetic overflow–Chapter 4 included logic in the ALU to detect overflow, and a signal called Overflow is provided as an output from the ALU. This signal is used in the modified finite state machine to specify an additional possible next state
• Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception-causing events such that control logic remains small and fast.
– Complex interactions makes the control unit the most challenging aspect of hardware design
Modification to the Control SpecificationIR <= MEM[PC]
PC <= PC + 4
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= S
S <= A op ZX
R[rt] <= S
ORi
S <= A + SX
R[rt] <= M
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= B
SW
other
undefined instruction
EPC <= PC - 4PC <= exp_addrcause <= 10 (RI)
EPC <= PC - 4PC <= exp_addrcause <= 12 (Ovf)
overflow
Additional condition fromDatapath
Equal
BEQ
PC <= PC + SX || 00
0010
0011
S <= A - B ~Equal
Summary• Specialize state-diagrams easily captured by microsequencer
– simple increment & “branch” fields
– datapath control fields
• Control design reduces to Microprogramming
• Exceptions are the hard part of control
• Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system
• As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder
Pipelining
Pipelining is Natural!
• Laundry Example• Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Pipelined Laundry: Start work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Pipelining Lessons • Pipelining doesn’t help latency
of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage
• Multiple tasks operating simultaneously using different resources
• Potential speedup = Number pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
• Stall for Dependences
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
Pipelined Execution
• Utilization?• Now we just have to make it work
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Time
Single Cycle, Multiple Cycle, vs. Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:
Ifetch Reg Exec Mem WrStore
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem WrR-type
Cycle 1 Cycle 2
Why Pipeline?
• Suppose we execute 100 instructions• Single Cycle Machine
–45 ns/cycle x 1 CPI x 100 inst = 4500 ns
• Multicycle Machine–10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600
ns
• Ideal pipelined machine–10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040
ns
Why Pipeline? Because the resources are there!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
U
Im Reg Dm Reg
AL
U
Im Reg Dm RegA
LU
Im Reg Dm Reg
AL
U
Im Reg Dm Reg
Can pipelining get us into trouble?• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two different ways at the same time
• E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
– data hazards: attempt to use item before it is ready• E.g., one sock of pair in dryer and one in washer; can’t fold
until get sock from washer through dryer• instruction depends on result of prior instruction still in the
pipeline– control hazards: attempt to make a decision before condition is
evaulated• E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in• branch instructions
• Can always resolve hazards by waiting– pipeline control must detect the hazard– take action (or delay action) to resolve hazards
Summary 1/3• Specialize state-diagrams easily captured by microsequencer
– simple increment & “branch” fields
– datapath control fields
• Control design reduces to Microprogramming
• Exceptions are the hard part of control
• Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system
• As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder
Summary 2/3
• Microprogramming is a fundamental concept– implement an instruction set by building a very simple
processor and interpreting the instructions
–essential for very complex instructions and when few register transfers are possible
• Pipelining is a fundamental concept–multiple steps using distinct resources
• Utilize capabilities of the Datapath by pipelined instruction processing
–start next instruction while working on the current one
– limited by length of longest stage (plus fill/flush)
–detect and resolve hazards
The Five Stages of Load
• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Read the data from the Data Memory• Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
Pipelining
• Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Basic Idea
• What do we need to add to actually split the datapath into stages?
Graphically Representing Pipelines
• Can help with answering questions like:–how many cycles does it take to execute this code?–what is the ALU doing during cycle 4?–use this representation to help understand datapaths
Conventional Pipelined Execution Representation
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Time
Single Cycle, Multiple Cycle, vs. Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:
Ifetch Reg Exec Mem WrStore
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem WrR-type
Cycle 1 Cycle 2
Mem
Single Memory is a Structural Hazard
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LU
Mem Reg Mem Reg
AL
U
Mem Reg Mem Reg
AL
U
Mem Reg Mem RegA
LU
Reg Mem Reg
AL
U
Mem Reg Mem Reg
Detection is easy in this case! (right half highlight means read, left half write)
• Stall: wait until decision is clear– Its possible to move up decision to 2nd stage by adding hardware to check
registers as being read
• Impact: 2 clock cycles per branch instruction => slow
Control Hazard Solutions
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
U
Mem Reg Mem Reg
AL
U
Mem Reg Mem RegA
LU
Reg Mem RegMem
• Predict: guess one direction then back up if wrong– Predict not taken
• Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right ญ 50% of time)
• More dynamic scheme: history of 1 branch (ญ 90%)
Control Hazard Solutions
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
U
Mem Reg Mem Reg
AL
U
Mem Reg Mem Reg
MemA
LU
Reg Mem Reg
• Redefine branch behavior (takes place after next instruction) “delayed branch”
• Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (ญ 50% of time)
• As launch more instruction per clock cycle, less useful
Control Hazard Solutions
Instr.
Order
Time (clock cycles)
Add
Beq
Misc
AL
U
Mem Reg Mem Reg
AL
U
Mem Reg Mem Reg
MemA
LU
Reg Mem Reg
Load Mem
AL
U
Reg Mem Reg
Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
• Dependencies backwards in time are hazardsData Hazard on
r1:
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF
EX MEM WB
AL
UIm Reg Dm Reg
AL
U
Im Reg Dm RegA
LU
Im Reg Dm Reg
Im
AL
U
Reg Dm Reg
AL
U
Im Reg Dm Reg
• “Forward” result from one stage to another
• “or” OK if define read/write properly
Data Hazard Solution:
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF
EX MEM WB
AL
UIm Reg Dm Reg
AL
U
Im Reg Dm RegA
LU
Im Reg Dm Reg
Im
AL
U
Reg Dm Reg
AL
U
Im Reg Dm Reg
• Dependencies backwards in time are hazards
• Can’t solve with forwarding: • Must delay/stall instruction dependent on loads
Forwarding (or Bypassing): What about Loads
Time (clock cycles)
lw r1,0(r2)
sub r4,r1,r3
IF
ID/RF
EX MEM WB
AL
UIm Reg Dm Reg
AL
U
Im Reg Dm Reg
Designing a Pipelined Processor
• Go back and examine your datapath and control diagram
• associated resources with states• ensure that flows do not conflict, or figure out how
to resolve• assert control in appropriate stage
Pipelined Processor (almost) for slides
• What happens if we start a new instruction every cycle?
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
emValid
IRex
Dcd
Ctr
l
IRm
em
Ex
Ctr
l
IRw
b
Mem
Ctr
l
WB
Ctr
l
Control and DatapathIR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If CondPC < PC+SX;
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
D
M
Pipelining the Load Instruction
• The five independent functional units in the pipeline datapath are:
–Instruction Memory for the Ifetch stage
–Register File’s Read ports (bus A and busB) for the Reg/Dec stage
–ALU for the Exec stage
–Data Memory for the Mem stage
–Register File’s Write port (bus W) for the Wr stage
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch Reg/Dec Exec Mem Wr1st lw
Ifetch Reg/Dec Exec Mem Wr2nd lw
Ifetch Reg/Dec Exec Mem Wr3rd lw
The Four Stages of R-type
• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode• Exec:
–ALU operates on the two register operands
–Update PC
• Wr: Write the ALU output back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-type
Pipelining the R-type and Load Instruction
• We have pipeline conflict or structural hazard:–Two instructions try to write to the register file at the same
time!
–Only one write port
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ops! We have a problem!
Important Observation• Each functional unit can only be used once per
instruction• Each functional unit must be used at the same stage for
all instructions:– Load uses Register File’s Write Port during its 5th stage
– R-type uses Register File’s Write Port during its 4th stage
Ifetch Reg/Dec Exec Mem WrLoad
1 2 3 4 5
Ifetch Reg/Dec Exec WrR-type
1 2 3 4
2 ways to solve this pipeline hazard.
Solution 1: Insert “Bubble” into the Pipeline
• Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle
–The control logic can be complex.
–Lose instruction fetch and issue opportunity.
• No instruction is started in Cycle 6!
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type Pipeline
Bubble
Ifetch Reg/Dec Exec Wr
Solution 2: Delay R-type’s Write by One Cycle• Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NOOP stage: nothing is being done.
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec WrR-type Mem
Exec
Exec
Exec
Exec
1 2 3 4 5
Modified Control & DatapathIR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– M;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– M;
S <– A + SX;
Mem[S] <- B
if Cond PC < PC+SX;
M <– S
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
D
M
M <– S
The Four Stages of Store
• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Write the data into the Data Memory
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec MemStore Wr
The Three Stages of Beq
• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory
• Reg/Dec: –Registers Fetch and Instruction Decode
• Exec: –compares the two register operand,
– select correct branch target address
– latch into PC
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec MemBeq Wr
Control Diagram IR <- Mem[PC]; PC < PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If Cond PC < PC+SX;
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
D
M <– S M <– S
M
Datapath + Data Stationary Control
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
rs rt
oprsrt
fun
im
exmewbrwv
mewbrwv
wbrwv
Let’s Try it Out
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
these addresses are octal
Start: Fetch 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
rs rt im
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
n n n n
10
Fetch 14, Decode 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
2 rt im
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
n n n
14
lw r
1, r
2(35
)
Fetch 20, Decode 14, Exec 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r2
B
SReg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
2 rt 35
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
n n
20
lw r
1
add
I r2,
r2,
3
Fetch 24, Decode 20, Exec 14, Mem 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r2
B
r2+
35
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
4 5 3
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
n
24lw
r1
sub
r3,
r4,
r5
add
I r2,
r2,
3
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r4
r5
r2+
3
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M[r
2+35
]6 7
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
30
lw r
1
beq
r6,
r7
100
add
I r2
sub
r3
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r6
r7
r2+
3
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
r1=
M[r
2+35
]
9 xx
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
34
beq ad
dI r
2
sub
r3
r4-r
5
100
ori
r8,
r9
17
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r9
x
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
r1=M[r2+35]
11 12
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
100
beq
r2 = r2+3
sub
r3
r4-r
5
17o
ri r
8
xxx
add
r10
, r11
, r12
ooops, we should have only one delayed instruction
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r11
r12
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
r1=M[r2+35]
14 15
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
104
beq
r2 = r2+3r3 = r4-r5
xx
ori
r8
xxx
add
r10
and
r13
, r14
, r15 n
Squash the extra instruction in the branch shadow!
r9 |
17
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r14
r15
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
r1=M[r2+35]
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
110
r2 = r2+3r3 = r4-r5
xx
ori
r8
add
r10
and
r13
n
Squash the extra instruction in the branch shadow!r9
| 17
r11+
r12
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Reg
File
PC
Nex
t P
C
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
r1=M[r2+35]
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
114
r2 = r2+3r3 = r4-r5
r8 = r9 | 17
add
r10
and
r13
n
Squash the extra instruction in the branch shadow!r1
1+r1
2
NO WBNO Ovflow
r14
& R
15
Summary: Pipelining• What makes it easy
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard?– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
• We’ll build a simple pipeline and look at these issues
• We’ll talk about modern processors and what really makes it hard:
– exception handling
– trying to improve performance with out-of-order execution, etc.
Summary
• Pipelining is a fundamental concept–multiple steps using distinct resources
• Utilize capabilities of the Datapath by pipelined instruction processing
–start next instruction while working on the current one
– limited by length of longest stage (plus fill/flush)
–detect and resolve hazards
What about Interrupts, Traps, Faults?• External Interrupts:
–Allow pipeline to drain,
–Load PC with interupt address
• Faults (within instruction, restartable)
–Force trap instruction into IF
–disable writes till trap hits WB
–must save multiple PCs or PC + state
Refer to MIPS solution
Exception Handling
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PClw $2,20($5)
Regs
A im op rwn
detect bad instruction address
detect bad instruction
detect overflow
detect bad data address
Allow exception to take effect
Exception Problem
• Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline
– How to stop the pipeline?
– Restart?
– Who caused the interrupt?Stage Problem interrupts occurring
IF Page fault on instruction fetch; misaligned memory access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error
• Load with data page fault, Add with instruction page fault?
• Solution 1: interrupt vector/instruction , check last stage
• Solution 2: interrupt ASAP, restart everything incomplete
Resolution: Freeze above & Bubble Below
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw rs rt
bubble
freeze
Memory
The Goal: illusion of large, fast, cheap memory
• Fact: Large memories are slow, fast memories are small
• How do we create a memory that is large, cheap and fast (most of the time)?
–Hierarchy
–Parallelism
An Expanded View of the Memory System
Control
Datapath
Memory
Processor
Mem
ory
Memory
MemoryMem
ory
Fastest Slowest
Smallest Biggest
Highest Lowest
Speed:
Size:
Cost:
Why hierarchy works
• The Principle of Locality:–Program access a relatively small portion of the address
space at any instant of time.
Address Space0 2^n - 1
Probabilityof reference
Memory Hierarchy: How Does it Work?
• Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the
processor
• Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper
levels Lower Level
MemoryUpper LevelMemory
To Processor
From ProcessorBlk X
Blk Y
Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block X)
– Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
• Hit Time << Miss Penalty
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Memory Hierarchy of a Modern Computer System• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the cheapest technology.
– Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On
-Ch
ipC
ache
1s 10,000,000s (10s ms)
Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage(Disk)
10,000,000,000s (10s sec)
Ts
How is the hierarchy managed?
• Registers <-> Memory–by compiler (programmer?)
• cache <-> memory–by the hardware
• memory <-> disks–by the hardware and operating system (virtual memory)
–by the programmer (files)
Example: 1 KB Direct Mapped Cache with 32 B Blocks• For a 2 ** N byte cache:
–The uppermost (32 - N) bits are always the Cache Tag
–The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
Extreme Example: single big line
• Cache Size = 4 bytes Block Size = 4 bytes– Only ONE entry in the cache
• If an item is accessed, likely that it will be accessed again soon– But it is unlikely that it will be accessed again immediately!!!
– The next access will likely to be a miss again
• Continually loading data into the cache butdiscard (force out) them before they are used again
• Worst nightmare of a cache designer: Ping Pong Effect
• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index
• Solution 1: make the cache size bigger
• Solution 2: Multiple entries for the same Cache Index
0
Cache DataValid Bit
Byte 0Byte 1Byte 3
Cache Tag
Byte 2
Another Extreme Example: Fully Associative• Fully Associative Cache
– Forget about the Cache Index
– Compare the Cache Tags of all cache entries in parallel
– Example: Block Size = 2 B blocks, we need N 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
:
Cache Data
Byte 0
0431
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
A Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared in parallel
– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Disadvantage of Set Associative Cache• N-way Set Associative Cache versus Direct Mapped Cache:
– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
A Summary on Sources of Cache Misses• Compulsory (cold start or process migration, first
reference): first access to a block– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant
• Conflict (collision):– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
• Capacity:– Cache cannot contain all blocks access by the program
– Solution: increase cache size
• Invalidation: other process (e.g., I/O) updates memory
Improving Cache Performance: 3 general options
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
4 Questions for Memory Hierarchy
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
Q1: Where can a block be placed in the upper level?
• Block 12 placed in 8 block cache:–Fully associative, direct mapped, 2-way set
associative
–S.A. Mapping = Block Number Modulo Number Sets
Q2: How is a block found if it is in the upper level?
• Tag on each block–No need to check index or block offset
• Increasing associativity shrinks index, expands tag
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped• Set Associative or Fully Associative:
– Random
– LRU (Least Recently Used)
Associativity: 2-way 4-way 8-way
Size LRURandomLRURandom LRURandom
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17%1.13% 1.13% 1.12% 1.12%
Q4: What happens on a write?
• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.
• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
– is block clean or dirty?
• Pros and Cons of each?–WT: read misses cannot result in writes
–WB: no writes of repeated writes
• WT always combined with write buffers so that don’t wait for lower level memory
Write Buffer for Write Through
• A Write Buffer is needed between the Cache and Memory
–Processor: writes data into the cache and the write buffer
–Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:–Typical number of entries: 4
–Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
ProcessorCache
Write Buffer
DRAM
Write-miss Policy: Write Allocate versus Not Allocate• Assume: a 16-bit write to memory location 0x0 and causes a miss
– Do we read in the block?
• Yes: Write Allocate
• No: Write Not Allocate
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x00
Ex: 0x00
0x00
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
Recall: Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns$.01-.001/bit
Main MemoryM Bytes100ns-1us$.01-.001
DiskG Bytesms10 - 10 cents-3 -4
CapacityAccess TimeCost
Tapeinfinitesec-min10-6
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Basic Issues in Virtual Memory System Designsize of information blocks that are transferred from
secondary to main storage (M)
block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy
which region of M is to hold the new block --> placement policy
missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy
Paging Organization
virtual and physical address space partitioned into blocks of equal sizepage frames
pages
pages
reg
cachemem disk
frame
Address MapV = {0, 1, . . . , n - 1} virtual address space
M = {0, 1, . . . , m - 1} physical address space
MAP: V --> M U {0} address mapping function
n > m
MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M
= 0 if data at virtual address a is not present in M
Processor
Name Space V
Addr TransMechanism
faulthandler
MainMemory
SecondaryMemory
a
aa'
0
missing item fault
physical address OS performsthis transfer
Paging Organizationframe 0
1
7
01024
7168
P.A.
PhysicalMemory
1K1K
1K
AddrTransMAP
page 01
31
1K1K
1K
01024
31744
unit of mapping
also unit oftransfer fromvirtual tophysical memory
Virtual Memory
Address Mapping
VA page no. disp10
Page Table
indexintopagetable
Page TableBase Reg
V AccessRights PA +
table locatedin physicalmemory
physicalmemoryaddress
actually, concatenation is more likely
V.A.
Virtual Address and a Cache
CPUTrans-lation
Cache MainMemory
VA PA miss
hitdata
It takes an extra memory access to translate VA to PA
This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible
ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!
for update: must update all cache entries with same physical address or memory becomes inconsistent
determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or
software enforced alias boundary: same lsb of VA &PA > cache size
TLBsA way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB
Virtual Address Physical Address Dirty Ref Valid Access
TLB access time comparable to cache access time (much less than main memory access time)
Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.
CPUTLB
LookupCache Main
Memory
VA PA miss
hit
data
Trans-lation
hit
miss
20 tt1/2 t
Translationwith a TLB
Summary #1/ 4:• The Principle of Locality:
– Program likely to access a relatively small portion of the address space at any instant of time.
• Temporal Locality: Locality in Time
• Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.
– Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!
– Capacity Misses: increase cache size
• Cache Design Space– total size, block size, associativity
– replacement policy
– write-hit policy (write-through, write-back)
– write-miss policy
Summary #2 / 4: The Cache Design Space
• Several interacting dimensions–cache size
–block size
–associativity
– replacement policy
–write-through vs write-back
–write allocation
Associativity
Cache Size
Block Size
Summary #3 / 4 : TLB, Virtual Memory• Caches, TLBs, Virtual Memory all understood by
examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled?
• Page tables map virtual address to physical address• TLBs are important for fast translation• TLB misses are significant in processor performance:
(funny times, as most systems can’t access all of 2nd level cache without TLB misses!)
Summary #4 / 4: Memory Hierachy
• Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy
• Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?