CSE 502: Computer Architecture - Stony Brook …...CSE502: Computer Architecture Superscalar Decode for RISC ISAs •Decode X insns. per cycle (e.g., 4-wide) –Just duplicate the

CSE502: Computer Architecture

CSE 502:Computer Architecture

Superscalar Decode


Superscalar Decode for RISC ISAs• Decode X insns. per cycle (e.g., 4-wide)

– Just duplicate the hardware

– Instructions aligned at 32-bit boundaries

32-bit inst

Decoder

decoded

inst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decoded

inst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decoded

inst

decoded

inst

decoded

inst

1-Fetch


uop Limits• How many uops can a decoder generate?

– For complex x86 insts, many are needed (10’s, 100’s?)

– Makes decoder horribly complex

– Typically there’s a limit to keep complexity under control• One x86 instruction 1-4 uops

• Most instructions translate to 1.5-2.0 uops

• What if a complex insn. needs more than 4 uops?


UROM/MS for Complex x86 Insts• UROM (microcode-ROM) stores the uop equivalents

– Used for nasty x86 instructions• Complex like > 4 uops (PUSHA or STRREP.MOV)

• Obsolete (like AAA)

• Microsequencer (MS) handles the UROM interaction


UROM/MS Example (3 uop-wide)

ADD

STORE

SUB

Cycle 1

ADD

STA

STD

SUB

Cycle 2

REP.MOV

ADD [ ]

Fetch- x86 insts

Decode - uops

UROM - uops

SUB

LOAD

STORE

REP.MOV

ADD [ ]

XOR

Cycle 3

INC

mJCC

LOAD

Cycle 4

Complex instructions, get

uops from mcode sequencer

REP.MOV

ADD [ ]

XOR

STORE

INC

mJCC

Cycle 5

REP.MOV

ADD [ ]

XOR

…

Cycle n

…

…

REP.MOV

ADD [ ]

XOR

mJCC

LOAD

Cycle n+1

ADD

XOR

…

…


Superscalar CISC Decode• Instruction Length Decode (ILD)

– Where are the instructions?• Limited decode – just enough to parse prefixes, modes

• Shift/Alignment

– Get the right bytes to the decoders

• Decode

– Crack into uops

Do this for N instructions per cycle!


ILD Recurrence/LoopPCi = X

PCi+1 = PCi + sizeof( Mem[PCi] )

PCi+2 = PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next insn. before decoding the first

• Must do ILD serially

– ILD of 4 insns/cycle implies cycle time will be 4x

Critical component not pipeline-able


Bad x86 Decode Implementation

Left Shifter

Decoder 3

Cyc

le 3

Decoder 2Decoder 1

Instruction Bytes (ex. 16 bytes)

ILD (limited decode)

Length 1

Length 2

+

Cyc

le 1

Inst 1 Inst 2 Inst 3 Remainder

Cyc

le 2

+

Length 3

bytes

decoded



ILD dominates cycle time; not scalable


Hardware-Intensive Decode

Decode from every

possible instruction

starting point!

Giant MUXes to

select instruction

bytes

ILD

DecoderIL

D

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

DecoderDecoder


ILD in Hardware-Intensive Approach

6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der


Predecoding• ILD loop is hardware intensive

– Impacts latency

– Consumes substantial power

• If instructions A, B and C are decoded

– ... lengths for A, B, and C will still be the same next time

– No need to repeat ILD

Possible to cache the ILD work


Decoder Example: AMD K5 (1/2)

Predecode Logic

From Memory

b0 b1 b2 … b78 bytes

I$

b0 b1 b2 … b7

8 bits

+5 bits

13 bytes

Decode

16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

Compute ILD on fetch, store ILD in the I$


Decoder Example: AMD K5 (2/2)• Predecode makes decode easier by providing:

– Instruction start/end location (ILD)

– Number of ROPs needed per inst

– Opcode and prefix locations

• Power/performance tradeoffs

– Larger I$ (increase array by 62.5%)• Longer I$ latency, more I$ power consumption

– Remove logic from decode• Shorter pipeline, simpler logic

• Cache and reused decode work less decode power

– Longer effective I-L1 miss latency (ILD on fill)


Decoder Example: Intel P-Pro

• Only Decoder 0 interfaces with the uROM and MS

• If insn. in Decoder 1 or Decoder 2 requires > 1 uop1) do not generate output

2) shift to Decoder 0 on the next cycle

16 Raw Instruction Bytes

Decoder

0

Decoder

1

Decoder

2

Branch

Address

Calculator

Fetch resteer

if needed

mROM

4 uops 1 uop 1 uop


Fetch Rate is an ILP Upper Bound• Instruction fetch limits performance

– To sustain IPC of N, must sustain a fetch rate of N per cycle• If you consume 1500 calories per day,

but burn 2000 calories per day,then you will eventually starve.

– Need to fetch N on average, not on every cycle

• N-wide superscalar ideally fetches N insns. per cycle

• This doesn’t happen in practice due to:

– Instruction cache organization

– Branches

– … and interaction between the two


Instruction Cache Organization• To fetch N instructions per cycle...

– L1-I line must be wide enough for N instructions

• PC register selects L1-I line

• A fetch group is the set of insns. starting at PC

– For N-wide machine, [PC,PC+N-1]

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Cache LinePC


Fetch Misalignment (1/2)• If PC = xxx01001, N=4:

– Ideal fetch group is xxx01001 through xxx01100 (inclusive)

Deco

der


Tag Inst Inst Inst Inst


000

001

010

011

111

PC: xxx01001 00 01 10 11

Line widthFetch group

Misalignment reduces fetch width


Fetch Misalignment (2/2)• Now takes two cycles to fetch N instructions

– ½ fetch bandwidth!

Deco

der




000001010011

111

PC: xxx01001 00 01 10 11

Deco

der




000001010011

111

PC: xxx01100 00 01 10 11

Inst Inst Inst

Inst

Cycle 1

Cycle 2

Inst Inst Inst

Might not be ½ by combining with the next fetch


Reducing Fetch Fragmentation (1/2)• Make |Fetch Group| < |L1-I Line|

Deco

der

TagInst Inst Inst InstInst Inst Inst Inst

TagInst Inst Inst Inst


Inst Inst Inst InstCache Line

PC

Can deliver N insns. when PC > N from end of line


Reducing Fetch Fragmentation (2/2)• Needs a “rotator” to decode insns. in correct order

Deco

der


TagInst Inst Inst Inst


Inst Inst Inst Inst

Rotator

Inst Inst Inst Inst

Aligned fetch group

PC

CSE 502: Computer Architecture - Stony Brook …...CSE502: Computer Architecture Superscalar Decode for RISC ISAs •Decode X insns. per cycle (e.g., 4-wide) –Just duplicate the

Documents