15-447 Computer ArchitectureFall 2008 © September 24, 2008 Nael Abu-Ghazaleh [email protected] msakr/15447-f08/ CS-447– Computer Architecture.

15-447 Computer Architecture Fall 2008 ©

September 24, 2008

Nael [email protected]

www.qatar.cmu.edu/~msakr/15447-f08/

CS-447– Computer Architecture

Lecture 12Multiple Cycle Datapath


Implementation vs. Performance

Performance of a processor is determined by

• Instruction count of a program

• CPI

• Clock cycle time (clock rate)

The compiler & the ISA determine the instruction count.

The implementation of the processor determines the CPI and the clock cycle time.


Possible Execution Steps of Any Instructions

° Instruction Fetch

° Instruction Decode and Register Fetch

° Execution of the Memory Reference Instruction

° Execution of Arithmetic-Logical operations

° Branch Instruction

° Jump Instruction


Instruction Processing° Five steps:

• Instruction fetch (IF)

• Instruction decode and operand fetch (ID)

• ALU/execute (EX)

• Memory (not required) (MEM)

• Write-back (WB)

Registers

Register #

Data

Register #

Datamemory

Address

Data

Register #

PC Instruction ALU

Instructionmemory

Address

IF

ID

EX

MEM

WB


Single Cycle Implementation

PC

Instructionmemory

Readaddress

Instruction

16 32

Add ALUresult

Mux

Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1Readregister 2

Shiftleft 2

4

Mux

ALU operation3

RegWrite

MemRead

MemWrite

PCSrc

ALUSrc

MemtoReg

ALUresult

ZeroALU

Datamemory

Address

Writedata

Readdata M

ux

Signextend

Add


Multiple ALUs and Memory Units

PC

Instructionmemory

Readaddress

Instruction

16 32

Add ALUresult

Mux

Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1Readregister 2

Shiftleft 2

4

Mux

ALU operation3

RegWrite

MemRead

MemWrite

PCSrc

ALUSrc

MemtoReg

ALUresult

ZeroALU

Datamemory

Address

Writedata

Readdata M

ux

Signextend

Add


Single Cycle Datapath


What’s Wrong with Single Cycle?

° All instructions run at the speed of the slowest instruction.

° Adding a long instruction can hurt performance• What if you wanted to include multiply?

° You cannot reuse any parts of the processor• We have 3 different adders to calculate PC+4,

PC+4+offset and the ALU

° No profit in making the common case fast• Since every instruction runs at the slowest instruction

speed- This is particularly important for loads as we will see later


What’s Wrong with Single Cycle?

1 ns – Register read/write time

2 ns – ALU/adder

2 ns – memory access

0 ns – MUX, PC access, sign extend, ROM

add: 2ns + 1ns + 2ns + 1ns = 6 ns

beq: 2ns + 1ns + 2ns = 5 ns

sw: 2ns + 1ns + 2ns + 2ns = 7 ns

lw: 2ns + 1ns + 2ns + 2ns + 1ns = 8 ns

Get read ALU mem writeInstr reg operation reg


Computing Execution Time

Assume: 100 instructions executed25% of instructions are loads,

10% of instructions are stores,

45% of instructions are adds, and

20% of instructions are branches.

Single-cycle execution:

100 * 8ns = 800 ns

Optimal execution:

25*8ns + 10*7ns + 45*6ns + 20*5ns = 640 ns


Single Cycle Problems° A sequence of instructions:

1. LW (IF, ID, EX, MEM, WB)

2. SW (IF, ID, EX, MEM)

3. etc

Clk

Single Cycle Implementation:

Load Store Waste

Cycle 1 Cycle 2

• what if we had a more complicated instruction like floating point?

• wasteful of area


Multiple Cycle Solution

• use a “smaller” cycle time

• have different instructions take different numbers of cycles

• a “multicycle” datapath:

Data

Register #

Register #

Register #

PC Address

Instructionor dataMemory Registers ALU

Instructionregister

Memorydata

register

ALUOut

A

BData


° We will be reusing functional units• ALU used to compute address and to increment PC

• Memory used for instruction and data

° We’ll use a finite state machine for control

Multicycle Approach

Data

Register #

Register #

Register #

PC Address

Instructionor dataMemory Registers ALU

Instructionregister

Memorydata

register

ALUOut

A

BData


The Five Stages of an Instruction

° IF: Instruction Fetch and Update PC

° ID: Instruction Decode and Registers Fetch

° Ex: Execute R-type; calculate memory address

° Mem: Read/write the data from/to the Data Memory

° WB: Write the result data into the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IF ID Ex Mem WB


° Break up the instructions into steps, each step takes a cycle

• balance the amount of work to be done

• restrict each cycle to use only one major functional unit

° At the end of a cycle

• store values for use in later cycles (easiest thing to do)

• introduce additional “internal” registers

Multicycle Implementation

Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

Zero

Readdata 1

Readdata 2

Signextend

16 32

Instruction[25–21]


Instruction[15–0]

ALUresult

Mux

Mux

Shiftleft 2

Instructionregister

PC 0

1

Mux

0

1

Mux

0

1

Mux

0

1A

B 0

1

2

3

ALUOut

Instruction[15–0]

Memorydata

register

Address

Writedata

Memory

MemData

4



The Five Stages of Load Instruction

° IF: Instruction Fetch and Update PC

° ID: Instruction Decode and Registers Fetch

° Ex: Execute R-type; calculate memory address

° Mem: Read/write the data from/to the Data Memory

° WB: Write the result data into the register file


IF ID Ex Mem WBlw


° Break the instruction execution into Clock Cycles

• Different instructions require a different number of clock cycles

• Clock cycle is limited by the slowest stage

• Instruction latency is not reduced (time from the start of an instruction to its completion)

Multiple Cycle Implementation


IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem


Single Cycle vs. Multiple Cycle

Clk

Cycle 1

Multiple Cycle Implementation:

IFetch Dec Exec Mem WB

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

IFetch Dec Exec Mem

lw sw

Clk

Single Cycle Implementation:

Load Store Waste

IFetch

R-type

Cycle 1 Cycle 2


° Break up the instructions into steps, each step takes a cycle

• balance the amount of work to be done

• restrict each cycle to use only one major functional unit

° At the end of a cycle

• store values for use in later cycles (easiest thing to do)

• introduce additional “internal” registers

Multicycle Implementation

Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

Zero

Readdata 1

Readdata 2

Signextend

16 32



Instruction[15–0]

ALUresult

Mux

Mux

Shiftleft 2

Instructionregister

PC 0

1

Mux

0

1

Mux

0

1

Mux

0

1A

B 0

1

2

3

ALUOut

Instruction[15–0]

Memorydata

register

Address

Writedata

Memory

MemData

4



Instructions from ISA perspective

° Consider each instruction from perspective of ISA.

° Example:

• The add instruction changes a register.

• Register specified by bits 15:11 of instruction.

• Instruction specified by the PC.

• New value is the sum (“op”) of two registers.

• Registers specified by bits 25:21 and 20:16 of the instruction

Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]

• In order to accomplish this we must break up the instruction.(kind of like introducing variables when

programming)


Breaking down an instruction

° ISA definition of arithmetic:

Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op

Reg[Memory[PC][20:16]]

° Could break down to:

•IR <= Memory[PC]

•A <= Reg[IR[25:21]]

•B <= Reg[IR[20:16]]

•ALUOut <= A op B

•Reg[IR[20:16]] <= ALUOut

° We forgot an important part of the definition of arithmetic!

•PC <= PC + 4


Idea behind multicycle approach

° We define each instruction from the ISA perspective (do this!)

° Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps)

° Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)

° Finally try and pack as much work into each step (avoid unnecessary cycles)

while also trying to share steps where possible(minimizes control, helps to simplify solution)


° Instruction Fetch

° Instruction Decode and Register Fetch

° Execution, Memory Address Computation, or Branch Completion

° Memory Access or R-type instruction completion

° Write-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Five Execution Steps


° Use PC to get instruction and put it in the Instruction Register.

° Increment the PC by 4 and put the result back in the PC.

° Can be described succinctly using RTL "Register-Transfer Language"

IR <= Memory[PC];PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch


° Read registers rs and rt in case we need them

° Compute the branch address in case the instruction is a branch

° RTL:

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

° We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 2: Instruction Decode and Register Fetch


° ALU is performing one of three functions, based on instruction type

° Memory Reference:

ALUOut <= A + sign-extend(IR[15:0]);

° R-type:

ALUOut <= A op B;

° Branch:

if (A==B) PC <= ALUOut;

Step 3 (instruction dependent)


° Loads and stores access memory

MDR <= Memory[ALUOut];or

Memory[ALUOut] <= B;

° R-type instructions finish

Reg[IR[15:11]] <= ALUOut;

Step 4 (R-type or memory-access)


°Reg[IR[20:16]] <= MDR;

Which instruction needs this?

Write-back step


Summary:


Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

Zero

Readdata 1

Readdata 2

Signextend

16 32




Instruction[15–0]

ALUresult

Mux

Mux

Shiftleft 2

Shiftleft 2

Instructionregister

PC 0

1

Mux

0

1

Mux

0

1

Mux

0

1A

B 0

1

2

3

Mux

0

1

2

ALUOut

Instruction[15–0]

Memorydata

register

Address

Writedata

Memory

MemData

4


PCWriteCond

PCWrite

IorD

MemRead

MemWrite

MemtoReg

IRWrite

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

26 28

Outputs

Control

Op[5–0]

ALUcontrol

PC [31–28]

Instruction [25-0]

Instruction [5–0]

Jumpaddress[31–0]

Multiple Cycle Implementation


° Finite state machines:

• a set of states and

• next state function (determined by current state and the input)

• output function (determined by current state and possibly input)

• We’ll use a Moore machine (output based only on current state)

Review: finite state machines

Inputs

Current state

Outputs

Clock

Next-statefunction

Outputfunction

Nextstate


° Value of control signals is dependent upon:

• what instruction is being executed

• which step is being performed

° Use the information we’ve accumulated to specify a finite state machine

• specify the finite state machine graphically, or

• use microprogramming

° Implementation can be derived from specification

Implementing the Control


Graphical Specification of FSMMemRead

ALUSrcA = 0IorD = 0IRWrite

ALUSrcB = 01ALUOp = 00

PCWritePCSource = 00

ALUSrcA = 0ALUSrcB = 11ALUOp = 00



MemReadIorD = 1

MemWriteIorD = 1

RegDst = 1RegWrite

MemtoReg = 0

RegDst = 1RegWrite

MemtoReg = 0

PCWritePCSource = 10


PCWriteCondPCSource = 01

Instruction decode/register fetch

Instruction fetch

0 1

Start

Jumpcompletion

9862

3

4

5 7

Memory readcompleton step

R-type completionMemoryaccess

Memoryaccess

ExecutionBranch

completionMemory address

computation


° Implementation:

Finite State Machine for Control

PCWrite

PCWriteCond

IorD

MemtoReg

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

NS3NS2NS1NS0

Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

State register

IRWrite

MemRead

MemWrite

Instruction registeropcode field

Outputs

Control logic

Inputs

15-447 Computer ArchitectureFall 2008 © September 24, 2008 Nael Abu-Ghazaleh [email protected] msakr/15447-f08/ CS-447– Computer Architecture.

Documents

ns slide

ns aluadder

ns mux

ns beq

computer architecturefall

ns memory access

ns optimal execution

instructions instruction