CSE502: Computer Architecture CSE 502: Computer Architecture Superscalar Decode
CSE502: Computer Architecture
CSE 502:Computer Architecture
Superscalar Decode
CSE502: Computer Architecture
Superscalar Decode for RISC ISAs• Decode X insns. per cycle (e.g., 4-wide)
– Just duplicate the hardware
– Instructions aligned at 32-bit boundaries
32-bit inst
Decoder
decoded
inst
scalar
Decoder Decoder Decoder
32-bit inst
Decoder
decoded
inst
superscalar
4-wide superscalar fetch
32-bit inst32-bit inst32-bit inst
decoded
inst
decoded
inst
decoded
inst
1-Fetch
CSE502: Computer Architecture
uop Limits• How many uops can a decoder generate?
– For complex x86 insts, many are needed (10’s, 100’s?)
– Makes decoder horribly complex
– Typically there’s a limit to keep complexity under control• One x86 instruction 1-4 uops
• Most instructions translate to 1.5-2.0 uops
• What if a complex insn. needs more than 4 uops?
CSE502: Computer Architecture
UROM/MS for Complex x86 Insts• UROM (microcode-ROM) stores the uop equivalents
– Used for nasty x86 instructions• Complex like > 4 uops (PUSHA or STRREP.MOV)
• Obsolete (like AAA)
• Microsequencer (MS) handles the UROM interaction
CSE502: Computer Architecture
UROM/MS Example (3 uop-wide)
ADD
STORE
SUB
Cycle 1
ADD
STA
STD
SUB
Cycle 2
REP.MOV
ADD [ ]
Fetch- x86 insts
Decode - uops
UROM - uops
SUB
LOAD
STORE
REP.MOV
ADD [ ]
XOR
Cycle 3
INC
mJCC
LOAD
Cycle 4
Complex instructions, get
uops from mcode sequencer
REP.MOV
ADD [ ]
XOR
STORE
INC
mJCC
Cycle 5
REP.MOV
ADD [ ]
XOR
…
Cycle n
…
…
REP.MOV
ADD [ ]
XOR
mJCC
LOAD
Cycle n+1
ADD
XOR
…
…
CSE502: Computer Architecture
Superscalar CISC Decode• Instruction Length Decode (ILD)
– Where are the instructions?• Limited decode – just enough to parse prefixes, modes
• Shift/Alignment
– Get the right bytes to the decoders
• Decode
– Crack into uops
Do this for N instructions per cycle!
CSE502: Computer Architecture
ILD Recurrence/LoopPCi = X
PCi+1 = PCi + sizeof( Mem[PCi] )
PCi+2 = PCi+1 + sizeof( Mem[PCi+1] )
= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )
• Can’t find start of next insn. before decoding the first
• Must do ILD serially
– ILD of 4 insns/cycle implies cycle time will be 4x
Critical component not pipeline-able
CSE502: Computer Architecture
Bad x86 Decode Implementation
Left Shifter
Decoder 3
Cyc
le 3
Decoder 2Decoder 1
Instruction Bytes (ex. 16 bytes)
ILD (limited decode)
Length 1
Length 2
+
Cyc
le 1
Inst 1 Inst 2 Inst 3 Remainder
Cyc
le 2
+
Length 3
bytes
decoded
ILD (limited decode)
ILD (limited decode)
ILD dominates cycle time; not scalable
CSE502: Computer Architecture
Hardware-Intensive Decode
Decode from every
possible instruction
starting point!
Giant MUXes to
select instruction
bytes
ILD
DecoderIL
D
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
DecoderDecoder
CSE502: Computer Architecture
ILD in Hardware-Intensive Approach
6 bytes
4 bytes
3 bytes+
Total bytes decode = 11Previous: 3 ILD + 2add
Now: 1ILD + 2(mux+add)
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
CSE502: Computer Architecture
Predecoding• ILD loop is hardware intensive
– Impacts latency
– Consumes substantial power
• If instructions A, B and C are decoded
– ... lengths for A, B, and C will still be the same next time
– No need to repeat ILD
Possible to cache the ILD work
CSE502: Computer Architecture
Decoder Example: AMD K5 (1/2)
Predecode Logic
From Memory
b0 b1 b2 … b78 bytes
I$
b0 b1 b2 … b7
8 bits
+5 bits
13 bytes
Decode
16 (8-bit inst + 5-bit predecode)
8 (8-bit inst + 5-bit predecode)
Up to 4 ROPs
Compute ILD on fetch, store ILD in the I$
CSE502: Computer Architecture
Decoder Example: AMD K5 (2/2)• Predecode makes decode easier by providing:
– Instruction start/end location (ILD)
– Number of ROPs needed per inst
– Opcode and prefix locations
• Power/performance tradeoffs
– Larger I$ (increase array by 62.5%)• Longer I$ latency, more I$ power consumption
– Remove logic from decode• Shorter pipeline, simpler logic
• Cache and reused decode work less decode power
– Longer effective I-L1 miss latency (ILD on fill)
CSE502: Computer Architecture
Decoder Example: Intel P-Pro
• Only Decoder 0 interfaces with the uROM and MS
• If insn. in Decoder 1 or Decoder 2 requires > 1 uop1) do not generate output
2) shift to Decoder 0 on the next cycle
16 Raw Instruction Bytes
Decoder
0
Decoder
1
Decoder
2
Branch
Address
Calculator
Fetch resteer
if needed
mROM
4 uops 1 uop 1 uop
CSE502: Computer Architecture
Fetch Rate is an ILP Upper Bound• Instruction fetch limits performance
– To sustain IPC of N, must sustain a fetch rate of N per cycle• If you consume 1500 calories per day,
but burn 2000 calories per day,then you will eventually starve.
– Need to fetch N on average, not on every cycle
• N-wide superscalar ideally fetches N insns. per cycle
• This doesn’t happen in practice due to:
– Instruction cache organization
– Branches
– … and interaction between the two
CSE502: Computer Architecture
Instruction Cache Organization• To fetch N instructions per cycle...
– L1-I line must be wide enough for N instructions
• PC register selects L1-I line
• A fetch group is the set of insns. starting at PC
– For N-wide machine, [PC,PC+N-1]
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst InstTag Inst Inst Inst Inst
Cache LinePC
CSE502: Computer Architecture
Fetch Misalignment (1/2)• If PC = xxx01001, N=4:
– Ideal fetch group is xxx01001 through xxx01100 (inclusive)
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000
001
010
011
111
PC: xxx01001 00 01 10 11
Line widthFetch group
Misalignment reduces fetch width
CSE502: Computer Architecture
Fetch Misalignment (2/2)• Now takes two cycles to fetch N instructions
– ½ fetch bandwidth!
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
PC: xxx01001 00 01 10 11
Deco
der
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
PC: xxx01100 00 01 10 11
Inst Inst Inst
Inst
Cycle 1
Cycle 2
Inst Inst Inst
Might not be ½ by combining with the next fetch
CSE502: Computer Architecture
Reducing Fetch Fragmentation (1/2)• Make |Fetch Group| < |L1-I Line|
Deco
der
TagInst Inst Inst InstInst Inst Inst Inst
TagInst Inst Inst Inst
TagInst Inst Inst InstInst Inst Inst Inst
Inst Inst Inst InstCache Line
PC
Can deliver N insns. when PC > N from end of line
CSE502: Computer Architecture
Reducing Fetch Fragmentation (2/2)• Needs a “rotator” to decode insns. in correct order
Deco
der
TagInst Inst Inst InstInst Inst Inst Inst
TagInst Inst Inst Inst
TagInst Inst Inst InstInst Inst Inst Inst
Inst Inst Inst Inst
Rotator
Inst Inst Inst Inst
Aligned fetch group
PC