Top Banner
1 Full Multicycle Datapath 5 5 RD1 RD2 RN1 RN2 WN WD RegWrite Registers Operation ALU 3 E X T N D 16 32 Zero RD WD MemRead Memory ADDR MemWrite 5 Instruction I 32 ALUSrcB <<2 PC 4 RegDst 5 I R M D R M U X 0 1 2 3 M U X 1 0 M U X 0 1 A B ALU OUT 0 1 2 M U X <<2 CONCAT 28 32 M U X 0 1 ALUSrcA jmpaddr I[25:0] rd MUX 0 1 rt rs immediate PCSource MemtoReg IorD PCWr* IRWrite
30

Full Multicycle Datapath

Feb 01, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Full Multicycle Datapath

1

Full Multicycle Datapath

5 5

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Registers

Operation

ALU

3

EXTND

16 32

ZeroRD

WDMemRead

MemoryADDR

MemWrite

5

Instruction I

32

ALUSrcB<<2

PC

4

RegDst

5

IR

MDR

MUX

0123

MUX

1

0

MUX

0

1A

BALUOUT

0

1

2MUX

<<2 CONCAT28 32

MUX

0

1

ALUSrcA

jmpaddrI[25:0]

rd

MUX0 1

rtrs

immediate

PCSource

MemtoReg

IorD

PCWr*

IRWrite

Page 2: Full Multicycle Datapath

2

Our new datapath • We eliminate both extra adders in a multicycle datapath, and

instead use just one ALU, with multiplexers to select the proper inputs.

• A 2-to-1 mux ALUSrcA sets the first ALU input to be the PC or a register.

• A 4-to-1 mux ALUSrcB selects the second ALU input from among: — the register file (for arithmetic operations), — a constant 4 (to increment the PC), — a sign-extended constant (for effective addresses), and — a sign-extended and shifted constant (for branch targets).

• This permits a single ALU to perform all of the necessary functions. — Arithmetic operations on two register operands. — Incrementing the PC. — Computing effective addresses for lw and sw. — Adding a sign-extended, shifted offset to (PC + 4) for branches.

Page 3: Full Multicycle Datapath

3

Full Multicycle Implementation

Page 4: Full Multicycle Datapath

4

Historical Perspective • In the ‘60s and ‘70s microprogramming was very important for

implementing machines • This led to more sophisticated ISAs and the VAX • In the ‘80s RISC processors based on pipelining became popular • Pipelining the microinstructions is also possible! • Implementations of IA-32 architecture processors since 486 use:

– “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA)

– “microcoded control” for more complex instructions (large numbers of cycles, central control store)

• The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store

Page 5: Full Multicycle Datapath

5

Pentium 4 • Pipelining is important (last IA-32 without it was 80386 in 1985)

• Pipelining is used for the simple instructions favored by compilers “Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions”

Control

Control

Control

Enhancedfloating pointand multimedia

Control

I/Ointerface

Instruction cache

Integerdatapath

Datacache

Secondarycacheandmemoryinterface

Advanced pipelininghyperthreading support

Chapters 3 and 4

Chapter 5

Page 6: Full Multicycle Datapath

6

Pentium 4 • Somewhere in all that “control” we must handle complex instructions

• Processor executes simple microinstructions, 70 bits wide (hardwired) • 120 control lines for integer datapath (400 for floating point) • If an instruction requires more than 4 microinstructions to implement,

control from microcode ROM (8000 microinstructions) • Its complicated!

Control

Control

Control

Enhancedfloating pointand multimedia

Control

I/Ointerface

Instruction cache

Integerdatapath

Datacache

Secondarycacheandmemoryinterface

Advanced pipelininghyperthreading support

Page 7: Full Multicycle Datapath

7

Summary (1) • A single-cycle CPU has two main disadvantages.

– The cycle time is limited by the worst case latency. – It requires more hardware than necessary.

• A multicycle processor splits instruction execution into several stages. – Instructions only execute as many stages as required. – Each stage is relatively simple, so the clock cycle time is

reduced. – Functional units can be reused on different cycles.

• We made several modifications to the single-cycle datapath. – The two extra adders and one memory were removed. – Multiplexers were inserted so the ALU and memory can be

used for different purposes in different execution stages. – New registers are needed to store intermediate results.

Page 8: Full Multicycle Datapath

8

Summary (2) • If we understand the instructions…

We can build a simple processor!

• If instructions take different amounts of time, multi-cycle is better

• Datapath implemented using:

– Combinational logic for arithmetic

– State holding elements to remember bits

• Control implemented using:

– Combinational logic for single-cycle implementation

– Finite state machine for multi-cycle implementation

Page 9: Full Multicycle Datapath

9

Pipeline: Introduction

These slides are adapted from notes by Dr. David Patterson (UCB)

Page 10: Full Multicycle Datapath

10

What is Pipelining?

• A way of speeding up execution of instructions

• Key idea:

overlap execution of multiple instructions

Page 11: Full Multicycle Datapath

11

The Laundry Analogy • Anna, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 30 minutes

• “Folder” takes 30 minutes

• “Stasher” takes 30 minutes to put clothes into drawers

A B C D

Page 12: Full Multicycle Datapath

12

If we do laundry sequentially...

30 T a s k

O r d e r

Time A 30 30 30 30

B

30 30 30

C

30 30 30 30

D

30 30 30 30

6 PM 7 8 9 10 11 12 1 2 AM

• Time Required: 8 hours for 4 loads

Page 13: Full Multicycle Datapath

13

12 2 AM 6 PM 7 8 9 10 11 1

Time 30 A

C

D

B

30 30 30 30 30 30 T a s k

O r d e r

To Pipeline, We Overlap Tasks

• Time Required: 3.5 Hours for 4 Loads

Page 14: Full Multicycle Datapath

14

12 2 AM 6 PM 7 8 9 10 11 1

Time 30 A

C

D

B

30 30 30 30 30 30 T a s k

O r d e r

To Pipeline, We Overlap Tasks

• Does Pipelining help latency of single task?

• Does Pipelining help throughput of entire workload?

• Pipeline rate limited by ___? • Multiple tasks operating simultaneously • Potential speedup = ? • Unbalanced lengths of pipe stages will

___ • Time to “fill” pipeline and time to

“drain” it reduces speedup

No

Yes

the slowest pipeline stage

Number of pipe stage

reduces speedup

Page 15: Full Multicycle Datapath

15

Pipelining a Digital System • Key idea: break big computation up into pieces

Separate each piece with a pipeline register

1ns

200ps 200ps 200ps 200ps 200ps

Pipeline Register

1 nanosecond = 10^-9 second 1 picosecond = 10^-12 second

Page 16: Full Multicycle Datapath

16

Pipelining a Digital System • Why do this? Because it's faster for repeated

computations

1ns

Non-pipelined: 1 operation finishes every 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined: 1 operation finishes every 200ps

Page 17: Full Multicycle Datapath

17

Comments about pipelining

• Pipelining increases throughput, but not latency – Answer available every 200ps, BUT – A single computation still takes 1ns

• Limitations: – Computations must be divisible into stage size – ? Pipeline registers add overhead

Page 18: Full Multicycle Datapath

18

Pipelining a Processor • Recall the 5 steps in instruction execution:

1. Instruction Fetch (IF) 2. Instruction Decode and Register Read (ID) 3. Execution operation or calculate address (EX) 4. Memory access (MEM) 5. Write result into register (WB)

• Review: Single-Cycle Processor

– All 5 steps done in a single clock cycle – Dedicated hardware required for each step

Page 19: Full Multicycle Datapath

19

Review - Single-Cycle Processor

What do we need to add to actually split the datapath into stages?

Page 20: Full Multicycle Datapath

20

The Basic Pipeline For MIPS

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

I n s t r.

O r d e r

Page 21: Full Multicycle Datapath

21

Basic Pipelined Processor

Page 22: Full Multicycle Datapath

22

Pipeline example: lw IF

Page 23: Full Multicycle Datapath

23

Pipeline example: lw ID

Page 24: Full Multicycle Datapath

24

Pipeline example: lw EX

Page 25: Full Multicycle Datapath

25

Pipeline example: lw MEM

Page 26: Full Multicycle Datapath

26

Pipeline example: lw WB

Can you find a problem?

Page 27: Full Multicycle Datapath

27

Basic Pipelined Processor (Corrected)

Page 28: Full Multicycle Datapath

28

Single-Cycle vs. Pipelined Execution

Non-Pipelined 0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $3, 300($0) InstructionFetch

TimeInstructionOrder

800ps

800ps

800ps

Pipelined 0 200 400 600 800 1000 1200 1400 1600

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0)

lw $3, 300($0)

TimeInstructionOrder

200psInstruction

FetchREGRD

ALU REGWR

MEM

InstructionFetch

REGRD

ALU REGWR

MEM200ps

200ps 200ps 200ps 200ps 200ps

Page 29: Full Multicycle Datapath

29

Single-Cycle vs. Pipelined Execution

Non-Pipelined 0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $3, 300($0) InstructionFetch

TimeInstructionOrder

800ps

800ps

800ps

Pipelined 0 200 400 600 800 1000 1200 1400 1600

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0)

lw $3, 300($0)

TimeInstructionOrder

200psInstruction

FetchREGRD

ALU REGWR

MEM

InstructionFetch

REGRD

ALU REGWR

MEM200ps

200ps 200ps 200ps 200ps 200ps

Page 30: Full Multicycle Datapath

30