Top Banner
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining
22

CSC 4250 Computer Architectures

Jan 13, 2016

Download

Documents

lecomte lecomte

CSC 4250 Computer Architectures. September 15, 2006 Appendix A. Pipelining. What is Pipelining?. Implementation technique whereby multiple instructions are overlapped in execution Pipelining exploits parallelism among the instructions in a sequential instruction stream - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSC 4250 Computer Architectures

CSC 4250Computer Architectures

September 15, 2006Appendix A. Pipelining

Page 2: CSC 4250 Computer Architectures

What is Pipelining?

Implementation technique whereby multiple instructions are overlapped in execution

Pipelining exploits parallelism among the instructions in a sequential instruction stream

Recall the formula: CPU time = IC × CPI × cct Pipelining yields a reduction in the average

execution time per instruction; i.e., it decreases the CPI

Page 3: CSC 4250 Computer Architectures

RISC Architectures

Reduced Instruction Set Computer All operations on data apply to data in registers Only operations that affect memory are loads

and stores that move data from memory to register or to memory from register, respectively

Instruction formats are few in number with all instructions typically the same in size

Page 4: CSC 4250 Computer Architectures

Three Classes of Instructions

We consider ALU instructions Load and store instructions Branches (no jumps)

Page 5: CSC 4250 Computer Architectures

ALU Instructions

Take either two registers or a register and a sign-extended immediate, operate on them, and store result into a third register:

DADD R1,R2,R3OpcodeR2 R3 R1 shamt opx

rs rt rdReg[R1] ← Reg[R2] + Reg[R3]

DADDI R1,R2,#3Opcode R2 R1 Immediate

rs rtReg[R1] ← Reg[R2] + 3

Page 6: CSC 4250 Computer Architectures

Load and Store Instructions

Take register source (base register) and immediate field (offset). The sum (effective address) is memory address. Second register is destination (load) or source (store) of data.

LD R2,30(R1)OpcodeR1 R2 ImmediateReg[R2] ← Mem[30+Reg[R1]]

SD R2,30(R1)OpcodeR1 R2 Immediate

Mem[offset+Reg[R1]] ← Reg[R2]

Page 7: CSC 4250 Computer Architectures

Branches

Branches are conditional transfers of control Branch destination obtained by adding a

sign-extended offset to current PC We consider only comparison against zero:

BEQZ R1,name

BEQZ is pseudo-instruction for BEQ with R0: BEQ R1,R0,name

Opcode R1 R0 Immediate

Page 8: CSC 4250 Computer Architectures

RISC Instruction Set

At most five clock cycles:1. Instruction fetch cycle (IF)

2. Instruction decode/register fetch cycle (ID)

3. Execution/effective address cycle (EX)

4. Memory access/branch completion (MEM)

5. Write-back cycle (WB)

Page 9: CSC 4250 Computer Architectures

Instruction Fetch (IF)

Send program counter (PC) to memory and fetch current instruction from memory;

Update PC by adding 4 (why 4?). Operations:

IR ← Mem[PC];

NPC ← PC + 4;

Page 10: CSC 4250 Computer Architectures

Instruction Decode/Register Fetch (ID) Decode instruction Read registers Decoding is done in parallel with reading registers

(fixed-field decoding) Sign-extend the offset field Operations:

A ← Reg[rs];B ← Reg[rt];Imm ← sign-extended immediate field of

IR(A and B are temporary registers).

Page 11: CSC 4250 Computer Architectures

Execution/Effective Address (EX) ALU operates on the operands prepared in ID,

performing one of four possible functions: Memory ref. (add base register and offset):

ALUOutput ← A + Imm Register-Register ALU instruction:

ALUOutput ← A func B Register-Immediate ALU instruction:

ALUOutput ← A op Imm Branch:

ALUOutput ← NPC + (Imm << 2) Cond ← (A == 0)

Page 12: CSC 4250 Computer Architectures

Memory Access/Branch Completion (MEM) PC is updated:

PC ← NPC Access memory if needed:

LMD = Load Memory Data RegisterLMD ← Mem[ALUOutput]

orMem[ALUOutput] ← B

Branch:If (cond) PC ← ALUOutput

Page 13: CSC 4250 Computer Architectures

Write Back (WB)

Register-Register ALU:

Reg[rd] ← ALUOutput Register-Immediate ALU:

Reg[rt] ← ALUOutput Load:

Reg[rt] ← LMD

Page 14: CSC 4250 Computer Architectures

Simple RISC Pipeline

Clock NumberInstr. # 1 2 3 4 5 6 7 8 9Instr. i IF ID EX ME WBInstr. i+1 IF ID EX ME WBInstr. i+2 IF ID EX ME WBInstr. i+3 IF ID EX ME WBInstr. i+4 IF ID EX ME WB

What are the stages needed for an ALU instruction? What are the stages needed for a Store instruction? What are the stages needed for a Branch instruction? Which stage is expected to take the most time?

Page 15: CSC 4250 Computer Architectures

Figure A.2. Pipeline

Page 16: CSC 4250 Computer Architectures

Three Observations on Overlapping Execution

1. Use separate instruction and data memories, which is typically implemented with separate instruction and data caches. The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access.

Page 17: CSC 4250 Computer Architectures

Three Observations on Overlapping Execution

2. The register file is used in two stages: one for reading in ID and one for writing in WB. These uses are distinct. Hence, we need to perform two reads and one write every clock cycle (why two reads?). To handle reads and a write to the same register (and for another reason that will arise), we perform the register write in the first half and the reads in the second half.

Page 18: CSC 4250 Computer Architectures

Three Observations on Overlapping Execution

3. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. Another problem is that a branch does not change the PC until the MEM stage (this problem will be handled soon).

Page 19: CSC 4250 Computer Architectures

Pipeline Registers

Prevent interference between two different instructions in adjacent stages in pipeline.

Carry data of a given instruction from one stage to the next.

Registers are triggered by clock edge ─ values change instantaneously on clock edge.

Add pipelining overhead.

Page 20: CSC 4250 Computer Architectures

Figure A.3. Pipeline Registers

Page 21: CSC 4250 Computer Architectures

Example

Consider unpipelined processor. Assume 1 ns clock cycle, 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Suppose relative frequencies are 40%, 20%, and 40%, respectively. The pipelining overhead is 0.2 ns. What is the speedup from pipelining?

Page 22: CSC 4250 Computer Architectures

Answer

Average execution time on unpipelined processor

= Clock ×Average CPI

= 1 ns × ((40%+20%)×4+40%×5)

= 4.4 ns Speedup from pipelining

= 4.4 ns / 1.2 ns

= 3.7