Lecture 6: Superscalar Decode and Other Pipelining.

Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining

2

RISC ISA Format• This should be review…

– Fixed-length• MIPS all insts are 32-bits/4 bytes

– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP

– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-

formats• Alpha ra/fa field in same bit-position for all 5 formats

Lecture 6: Superscalar Decode and Other Pipelining

3

RISC Decode (MIPS)

000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz

001 addiaddi

uslti sltiu andi ori xori lui

010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3


opcode6

other21

func5

R-format only

op

code[5

,3]

opcode[2,0]

001xxx = Immediate

1xxxxx = Memory(1x0: LD, 1x1: ST)

000xxx = Br/Jump(except for 000000)

6

Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must

decode X instructions per cycle– Just duplicate the hardware


32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

decodedinst

decodedinst

1-Fetch

7

VLIW/EPIC ISAs• Compiler finds the parallelism, packs

multiple instructions into a “very long” instruction


Add

Load

Branch

Sub

Store

Xor

RISC

IMB Add Load Branch

IMI Sub Store Xor

VLIW (EPIC-like)

“template” bits

8

VLIW Decoder• Similar to superscalar RISC decoder


Decoder

inst0

TemplateDecoder

tmplt

Decoder Decoder

inst1 inst2

decodedinst

decodedinst

decodedinst

9

CISC ISA• RISC focus on fast access to information

– easy decode, I$, large RF’s, D$• CISCs are older

– designed in era with fewer transistors, chips– each memory access very expensive

• pack as much work into as few bytes as possible• more “expressive” instructions

– compare to simple RISC insts– better potential code generation– more complex code generation in practice


10

Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:

– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes

– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing

mode


11

VAX operand addressingMode Example Meaning

Register Add R4, R3 R4 = R4 + R3

Immediate Add R4, #3 R4 = R4 + 3

Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]

Register Indirect

Add R4, (R1) R4 = R4 + Mem[R1]

Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]

Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]

Memory Indirect

Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]

Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++

Auto-Decrement

Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining

Any mode could be applied to any instruction!

12

x86• CISC, stemming from the original 4004• Example: “Move” instructions

1. General Purpose data movement– RR, MR, RM, IR, IM

2. Exchanges– EAX ↔ ECX, byte order within a register

3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA

4. Type Conversion5. Conditional Moves


Many ways to do the same/similar operation

13

x86 Encoding• Basic x86 Instruction:


Prefixes0-4 bytes

Opcode1-2 bytes

Mod R/M0-1 bytes

SIB0-1 bytes

Displacement0/1/2/4 bytes

Immediate0/1/2/4 bytes

Longest Inst 15 bytesShortest Inst: 1 byte

• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is

used– Mod R/M and SIB may specify additional

constants

14

Mod R/M Byte

• Mode = 00: No-displacement, use Mem[ regmmm ]

• Mode = 01: 8-bit displacement, Mem[ regmmm + SExt(disp) ]

• Mode = 10: 32-bit displacement (similar to previous)

• Mode = 11: Register-to-Register, use regmmm


M M r r r m m m

Mode Register R/M

1 of 8 registers

Add EBX, ECX 11 011 001Mod R/M

Add EBX, [ECX] 00 011 001Mod R/M

15

Exceptions• Mod=00, R/M = 5 get operand from 32-

bit immediate– Add EDX =

EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”

byte– SIB = Scale/Index/Base


000101010cff1234

00010100

Mod R/M

ss iii bbb

SIBbbb 5: use regbbb

bbb = 5: use 32-bit imm(Mod = 00 only)

iii 4: use siiii = 4: use 0

si = regiii << ss

16

Opcode Confusion• There are different opcodes for AB and

BA


10001011 11000011 MOV EAX, EBX

10001001 11000011 MOV EBX, EAX

10001001 11011000 MOV EAX, EBX

• If Opcode = 0F, then use next byte as opcode

• If Opcode = D8-DF, then FP instruction

11011000 11 R/M

FP opcode

17

x86 Decode Example


11000111

MOV regimm (use Mod R/M, 32-bit Imm to follow)

opcode10000100

Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)

reg ignored

Mod R/M

11000011SIB

ss=3 Scale by 8use EAX, EBX

Disp Imm

*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction

Note: Add 4 prefixes, andyou reach the max size

18

In RISC (MIPS)

1. lui R1 = Disp[31:16]2. ori R1 = R1,

Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,

Imm[15:0]8. st [R3] R1


8 instructions, 32 bits each32 bytes total

2.9x Bigger!

19

x86-64 / EM64T• 816 general purpose registers

– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit

– New “REX” prefix byte to specify additional information


0100 m R I B opcode

m=0 64-bit modem=1 32-bit mode

md rrr mrm

R rrr

ss iii bbb

iiiIbbbB

REX

Register specifiers are now 4 bitseach: can choose 1 of 16 registers

20

64-bit Extensions to IA32


IA32+64-bit exts IA32

CPU architect

(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)

Ugly? Scary?… but it works

21

x86 Decode Hardware


PrefixDecoder

Left Shift

Num Prefixes

opcodedecoder

2nd opcodedecoder

Mod R/Mdecoder

SIBdecoder

Left Shift Left Shift

+ +

Instruction bytes

22

Decoded x86 Format• RISC: easy to expand union of needed

info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored

• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers

• would lead to 100’s of bits• common case only needs a fraction a lot of waste


23

x86 RISC-like mops• Each x86 instruction decoded into a variable

number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical


ADD EAX, EBX ADD EAX, EBX 1 uop

ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp

2 uops

ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX

STA EAXSTD tmp

4 uops

24

uop Limits• How many uops can a decoder generate?

– For complex x86 insts, many are needed (10’s, 100’s?)

– Makes decoder horribly complex– Typically there’s a limit to keep complexity

under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops

• Ok, what happens if a complex instruction needs more than 4 uops?


25

UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop

equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like

PUSHA or STRREP.MOV) or obsolete instructions (AAA)

• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation


26

UROM/MS Example (3 uop-wide)


ADD

STORE

SUB

Cycle 1

ADD

STA

STD

SUB

Cycle 2

REP.MOV

ADD [ ]

Fetch- x86 insts

Decode - uops

UROM - uops

SUB

LOAD

STORE

REP.MOV

ADD [ ]

XOR

Cycle 3

INC

mJCC

LOAD

Cycle 4

STORE

INC

mJCC

Cycle 5

…

Cycle …

…

…

mJCCREP.MOV

ADD [ ]

XOR

LOAD

Cycle n Cycle n+1

ADD

Complex instructions, getuops from mcode sequencer

27

Superscalar CISC Decode

1. Instruction Length Decode (ILD)• Where are the instructions?

• Limited decode – just enough to parse prefixes, modes

2. Shift/Alignment• Get the right bytes to the decoders

3. Decode• Crack into uops


And then do this for N instructions per cycle!

28

ILD Recurrence/Loop• PCi = X

• PCi+1= PCi + sizeof( Mem[PCi] )

• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next instruction without decoding the first

• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle

time will be 4 x latency(ILD)


29

Decode Implementation


Left Shifter

Decoder 3

Cycl

e 3

Decoder 2Decoder 1

ILD dominatescycle time; not

scalable

Instruction Bytes (ex. 16 bytes)

ILD (limited decode)Length 1

Length 2

+

Cycl

e 1

Inst 1 Inst 2 Inst 3 Remainder

Cycl

e 2

+

Length 3

bytesdecoded

ILD (limited decode)

ILD (limited decode)

30

Hardware-Intensive Decode


Decode from everypossible instructionstarting point!

Giant MUXes toselect instructionbytes

ILD

DecoderIL

D

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

DecoderDecoder

31

ILD in Hardware-Intensive Approach


6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

32

Predecoding• ILD loop is hardware intensive, impacts

latency, and can consume substantial power

• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times


33

Decoder Example: AMD K5


Predecode Logic

From Memory

b0 b1 b2 … b78 bytes

I$

b0 b1 b2 … b7

8 bits

+5 bits

13 bytes

Decode

16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

34

Decoder Example: AMD K5• Predecode information makes decode

easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations

• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)

• Longer I$ latency, More I$ power consumption– Remove logic from decode

• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power

– Longer effective IL1 miss latency


35

Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions

– x86 insts are not aligned, may span two cache lines

– can’t decode until both halves have been fetched

• Instruction complexity– decoding “complex” (2-4 uop) instructions

requires a more complex decoder; expensive to replicate

– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings


36

Decoder Example: Intel P-Pro


16 Raw Instruction Bytes

Decoder0

Decoder1

Decoder2

BranchAddress

Calculator

Fetch resteerif needed

mROM

4 uops 1 uop 1 uop

If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle

Only Decoder 0 can interface with the uROM and MS

37

Decoder Example: Intel P4


L2 Cache

Raw instruction bytes

4-uop DecoderuROM

trace const. buffer

Trace Cache

Decode at mostone inst per cycle

Fetch up to 3 uopsper cycle

P4 has a strangled front-end, at best it can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)

More on this when westudy the P4 in detail

38

Pipeline Control


Dispatch ROB, LSQ, RSfull, stall front-end

Disp

ROB, LSQ, RSfull, stall front-end

RenDecDecRotILDI$I$BPred

Except not everyone stalls…

This logic starting toget pretty intense

DecodeFetch

39

Just because there’s a stall conditionsomewhere does not imply that

everybody has to stall

Non-Uniform Stall


Disp

nops due to I$ miss

ROBFull!

not full

full

full

full

full

RenDecDecRotILDI$I$BP

40

Compressing/Serpentine Pipelines


Disp

ROBFull!

1 entryfree

RenDecDecRotILDI$I$BP

Better “flow”, but much morecomplex since need to track howmany insts can advance per stage

41

Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations

– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)

• Renamer – out of physical registers


42

Smaller Control Domains• Separate long pipeline into multiple smaller

pipelines


BPred I$ I$ Dec Dec Dec Ren Alloc Sched

BPred I$ I$ Dec Dec Dec

Ren Alloc Sched

43

Smaller Control Domains (2)

• Non-decoupled pipe needed logic to simultaneously control ~10 stages

• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3

real stages, plus the queue ahead and behind)


Pipeline ControlLogic

Non-decoupled

I$ Dec Dec Dec Ren

Pipeline ControlLogic

Decoupled

XX

No direct control logicfor stages outside of

local pipeline

44

Smaller Control Domains (3)

• Queues can effectively add more pipeline stages


cycleboundary

previous stage next stage

Inter-pipe queueenqueue

logic latchdequeue

logic

previous stage next stage

• Avoid this by writing and reading in the same cycle (affects timing, complexity)

45

Queues provide Smoothing• Approximation to serpentine pipes

(compress only at certain locations – i.e., the queues)

• Different levels of decoupling possible depending on frequency target, power, complexity tolerance


BPred I$ I$ Dec Ren Sched

The “SimpleScalar” pipeline

Note: RS is effectivelya queue (more later)

46

Different Clocking Domains• Decoupling the pipe allows each segment

to operate independently (local control)• Also means each can run at different

speeds (P4)


TC TC Dec … Alloc

Sched … WB

(ROB)

Commit

(IAQ)

(uopQ)

1x freq(3 uops/Mclk)½x freq

(6 uops/2Mclk’s 3 uops/Mclk)

2x freq(2 uops/Fclk 4 uops/Mclk)

1x freq(3 uops/Mclk)

Lecture 6: Superscalar Decode and Other Pipelining.

Documents