CSC 4250 Computer Architectures

CSC 4250Computer Architectures

September 15, 2006Appendix A. Pipelining

What is Pipelining?

Implementation technique whereby multiple instructions are overlapped in execution

Pipelining exploits parallelism among the instructions in a sequential instruction stream

Recall the formula: CPU time = IC × CPI × cct Pipelining yields a reduction in the average

execution time per instruction; i.e., it decreases the CPI

RISC Architectures

Reduced Instruction Set Computer All operations on data apply to data in registers Only operations that affect memory are loads

and stores that move data from memory to register or to memory from register, respectively

Instruction formats are few in number with all instructions typically the same in size

Three Classes of Instructions

We consider ALU instructions Load and store instructions Branches (no jumps)

ALU Instructions

Take either two registers or a register and a sign-extended immediate, operate on them, and store result into a third register:

DADD R1,R2,R3OpcodeR2 R3 R1 shamt opx

rs rt rdReg[R1] ← Reg[R2] + Reg[R3]

DADDI R1,R2,#3Opcode R2 R1 Immediate

rs rtReg[R1] ← Reg[R2] + 3

Load and Store Instructions

Take register source (base register) and immediate field (offset). The sum (effective address) is memory address. Second register is destination (load) or source (store) of data.

LD R2,30(R1)OpcodeR1 R2 ImmediateReg[R2] ← Mem[30+Reg[R1]]

SD R2,30(R1)OpcodeR1 R2 Immediate

Mem[offset+Reg[R1]] ← Reg[R2]

Branches

Branches are conditional transfers of control Branch destination obtained by adding a

sign-extended offset to current PC We consider only comparison against zero:

BEQZ R1,name

BEQZ is pseudo-instruction for BEQ with R0: BEQ R1,R0,name

Opcode R1 R0 Immediate

RISC Instruction Set

At most five clock cycles:1. Instruction fetch cycle (IF)

2. Instruction decode/register fetch cycle (ID)

3. Execution/effective address cycle (EX)

4. Memory access/branch completion (MEM)

5. Write-back cycle (WB)

Instruction Fetch (IF)

Send program counter (PC) to memory and fetch current instruction from memory;

Update PC by adding 4 (why 4?). Operations:

IR ← Mem[PC];

NPC ← PC + 4;

Instruction Decode/Register Fetch (ID) Decode instruction Read registers Decoding is done in parallel with reading registers

(fixed-field decoding) Sign-extend the offset field Operations:

A ← Reg[rs];B ← Reg[rt];Imm ← sign-extended immediate field of

IR(A and B are temporary registers).

Execution/Effective Address (EX) ALU operates on the operands prepared in ID,

performing one of four possible functions: Memory ref. (add base register and offset):

ALUOutput ← A + Imm Register-Register ALU instruction:

ALUOutput ← A func B Register-Immediate ALU instruction:

ALUOutput ← A op Imm Branch:

ALUOutput ← NPC + (Imm << 2) Cond ← (A == 0)

Memory Access/Branch Completion (MEM) PC is updated:

PC ← NPC Access memory if needed:

LMD = Load Memory Data RegisterLMD ← Mem[ALUOutput]

orMem[ALUOutput] ← B

Branch:If (cond) PC ← ALUOutput

Write Back (WB)

Register-Register ALU:

Reg[rd] ← ALUOutput Register-Immediate ALU:

Reg[rt] ← ALUOutput Load:

Reg[rt] ← LMD

Simple RISC Pipeline

Clock NumberInstr. # 1 2 3 4 5 6 7 8 9Instr. i IF ID EX ME WBInstr. i+1 IF ID EX ME WBInstr. i+2 IF ID EX ME WBInstr. i+3 IF ID EX ME WBInstr. i+4 IF ID EX ME WB

What are the stages needed for an ALU instruction? What are the stages needed for a Store instruction? What are the stages needed for a Branch instruction? Which stage is expected to take the most time?

Figure A.2. Pipeline

Three Observations on Overlapping Execution

1. Use separate instruction and data memories, which is typically implemented with separate instruction and data caches. The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access.


2. The register file is used in two stages: one for reading in ID and one for writing in WB. These uses are distinct. Hence, we need to perform two reads and one write every clock cycle (why two reads?). To handle reads and a write to the same register (and for another reason that will arise), we perform the register write in the first half and the reads in the second half.


3. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. Another problem is that a branch does not change the PC until the MEM stage (this problem will be handled soon).

Pipeline Registers

Prevent interference between two different instructions in adjacent stages in pipeline.

Carry data of a given instruction from one stage to the next.

Registers are triggered by clock edge ─ values change instantaneously on clock edge.

Add pipelining overhead.

Figure A.3. Pipeline Registers

Example

Consider unpipelined processor. Assume 1 ns clock cycle, 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Suppose relative frequencies are 40%, 20%, and 40%, respectively. The pipelining overhead is 0.2 ns. What is the speedup from pipelining?

Answer

Average execution time on unpipelined processor

= Clock ×Average CPI

= 1 ns × ((40%+20%)×4+40%×5)

= 4.4 ns Speedup from pipelining

= 4.4 ns / 1.2 ns

= 3.7

CSC 4250 Computer Architectures

Documents