Advanced Microarchitecture Lecture 6: Superscalar Decode and Other Pipelining
Advanced MicroarchitectureLecture 6: Superscalar Decode and Other Pipelining
2
RISC ISA Format• This should be review…
– Fixed-length• MIPS all insts are 32-bits/4 bytes
– Few formats• MIPS has 3: R-, I-, J- formats• Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP
– Regularity across formats (when possible/practical)• MIPS, Alpha opcode in same bit-position for all formats• MIPS rs & rt fields in same bit-position for I- and J-
formats• Alpha ra/fa field in same bit-position for all 5 formats
Lecture 6: Superscalar Decode and Other Pipelining
3
RISC Decode (MIPS)
000 001 010 011 100 101 110 111000 func rt j jal beq bne blez bgtz
001 addiaddi
uslti sltiu andi ori xori lui
010 rs rs rs rs011100 lb lh lwl lw lbu lhu lwr101 sb sh swl sw swr110 lwc0 lwc1 lwc2 lwc3111 swc0 swc1 swc2 swc3
Lecture 6: Superscalar Decode and Other Pipelining
opcode6
other21
func5
R-format only
op
code[5
,3]
opcode[2,0]
001xxx = Immediate
1xxxxx = Memory(1x0: LD, 1x1: ST)
000xxx = Br/Jump(except for 000000)
6
Superscalar Decode for RISC ISAs• To sustain X instructions per cycle, must
decode X instructions per cycle– Just duplicate the hardware
Lecture 6: Superscalar Decode and Other Pipelining
32-bit inst
Decoder
decodedinst
scalar
Decoder Decoder Decoder
32-bit inst
Decoder
decodedinst
superscalar
4-wide superscalar fetch
32-bit inst32-bit inst32-bit inst
decodedinst
decodedinst
decodedinst
1-Fetch
7
VLIW/EPIC ISAs• Compiler finds the parallelism, packs
multiple instructions into a “very long” instruction
Lecture 6: Superscalar Decode and Other Pipelining
Add
Load
Branch
Sub
Store
Xor
RISC
IMB Add Load Branch
IMI Sub Store Xor
VLIW (EPIC-like)
“template” bits
8
VLIW Decoder• Similar to superscalar RISC decoder
Lecture 6: Superscalar Decode and Other Pipelining
Decoder
inst0
TemplateDecoder
tmplt
Decoder Decoder
inst1 inst2
decodedinst
decodedinst
decodedinst
9
CISC ISA• RISC focus on fast access to information
– easy decode, I$, large RF’s, D$• CISCs are older
– designed in era with fewer transistors, chips– each memory access very expensive
• pack as much work into as few bytes as possible• more “expressive” instructions
– compare to simple RISC insts– better potential code generation– more complex code generation in practice
Lecture 6: Superscalar Decode and Other Pipelining
10
Example: VAX• Superset of ISAs, incl. IBM360, DEC PDP-11• VAX = “Virtual Address Extension”• 16 32-bit registers• 32-bit memory addressing• Encoding:
– 1-2 byte opcode, followed by 0-6 operand specifiers, each of which may be up to 5 bytes
– Opcode implies datatype, size, # operands– Orthogonality: any opcode with any addressing
mode
Lecture 6: Superscalar Decode and Other Pipelining
11
VAX operand addressingMode Example Meaning
Register Add R4, R3 R4 = R4 + R3
Immediate Add R4, #3 R4 = R4 + 3
Displacement Add R4, 100(R1) R4 = R4 + Mem[100+R1]
Register Indirect
Add R4, (R1) R4 = R4 + Mem[R1]
Indexed/Base Add R3, (R1+R2) R3 = R3 + Mem[R1+R2]
Direct/Absolute Add R1, (1234) R1 = R1 + Mem[1234]
Memory Indirect
Add R1, @(R3) R1 = R1 + Mem[Mem[R3]]
Auto-Increment Add R1,(R2)+ R1 = R1 + Mem[R2]; R2++
Auto-Decrement
Add R1, -(R2) R2--; R1 = R1 + Mem[R2]Lecture 6: Superscalar Decode and Other Pipelining
Any mode could be applied to any instruction!
12
x86• CISC, stemming from the original 4004• Example: “Move” instructions
1. General Purpose data movement– RR, MR, RM, IR, IM
2. Exchanges– EAX ↔ ECX, byte order within a register
3. Stack Manipulation– push pop R ↔ Stack, PUSHA/POPA
4. Type Conversion5. Conditional Moves
Lecture 6: Superscalar Decode and Other Pipelining
Many ways to do the same/similar operation
13
x86 Encoding• Basic x86 Instruction:
Lecture 6: Superscalar Decode and Other Pipelining
Prefixes0-4 bytes
Opcode1-2 bytes
Mod R/M0-1 bytes
SIB0-1 bytes
Displacement0/1/2/4 bytes
Immediate0/1/2/4 bytes
Longest Inst 15 bytesShortest Inst: 1 byte
• Opcode specifies operation, and if the Mod R/M byte is used– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is
used– Mod R/M and SIB may specify additional
constants
14
Mod R/M Byte
• Mode = 00: No-displacement, use Mem[ regmmm ]
• Mode = 01: 8-bit displacement, Mem[ regmmm + SExt(disp) ]
• Mode = 10: 32-bit displacement (similar to previous)
• Mode = 11: Register-to-Register, use regmmm
Lecture 6: Superscalar Decode and Other Pipelining
M M r r r m m m
Mode Register R/M
1 of 8 registers
Add EBX, ECX 11 011 001Mod R/M
Add EBX, [ECX] 00 011 001Mod R/M
15
Exceptions• Mod=00, R/M = 5 get operand from 32-
bit immediate– Add EDX =
EDX+Mem[0cff1234]• Mod=00, 01 or 10, R/M = 4 use the “SIB”
byte– SIB = Scale/Index/Base
Lecture 6: Superscalar Decode and Other Pipelining
000101010cff1234
00010100
Mod R/M
ss iii bbb
SIBbbb 5: use regbbb
bbb = 5: use 32-bit imm(Mod = 00 only)
iii 4: use siiii = 4: use 0
si = regiii << ss
16
Opcode Confusion• There are different opcodes for AB and
BA
Lecture 6: Superscalar Decode and Other Pipelining
10001011 11000011 MOV EAX, EBX
10001001 11000011 MOV EBX, EAX
10001001 11011000 MOV EAX, EBX
• If Opcode = 0F, then use next byte as opcode
• If Opcode = D8-DF, then FP instruction
11011000 11 R/M
FP opcode
17
x86 Decode Example
Lecture 6: Superscalar Decode and Other Pipelining
11000111
MOV regimm (use Mod R/M, 32-bit Imm to follow)
opcode10000100
Mod=2 (use 32-bit Disp)R/M = 4 (use SIB)
reg ignored
Mod R/M
11000011SIB
ss=3 Scale by 8use EAX, EBX
Disp Imm
*( (EAX<<3) + EBX + Disp ) = ImmTotal: 11 byte instruction
Note: Add 4 prefixes, andyou reach the max size
18
In RISC (MIPS)
1. lui R1 = Disp[31:16]2. ori R1 = R1,
Disp[15:0]3. add R1 = R1 + R24. shli R3 = R3 << 35. add R3 = R3 + R16. lui R1 = Imm[31:16]7. ori R1 = R1,
Imm[15:0]8. st [R3] R1
Lecture 6: Superscalar Decode and Other Pipelining
8 instructions, 32 bits each32 bytes total
2.9x Bigger!
19
x86-64 / EM64T• 816 general purpose registers
– only 3-bit register fields?• Registers extended from 3264 bits each• Default: instructions still 32-bit
– New “REX” prefix byte to specify additional information
Lecture 6: Superscalar Decode and Other Pipelining
0100 m R I B opcode
m=0 64-bit modem=1 32-bit mode
md rrr mrm
R rrr
ss iii bbb
iiiIbbbB
REX
Register specifiers are now 4 bitseach: can choose 1 of 16 registers
20
64-bit Extensions to IA32
Lecture 6: Superscalar Decode and Other Pipelining
IA32+64-bit exts IA32
CPU architect
(Taken from Bob Colwell’s Eckert-MauchlyAward Talk, ISCA 2005)
Ugly? Scary?… but it works
21
x86 Decode Hardware
Lecture 6: Superscalar Decode and Other Pipelining
PrefixDecoder
Left Shift
Num Prefixes
opcodedecoder
2nd opcodedecoder
Mod R/Mdecoder
SIBdecoder
Left Shift Left Shift
+ +
Instruction bytes
22
Decoded x86 Format• RISC: easy to expand union of needed
info– generalized opcode (not too hard)– reg1, reg2, reg3, immediate (possibly extended)– some fields ignored
• CISC: union of all possible info is huge– generalized opcode (too many options)– up to 3 regs, 2 immediates– segment information– “rep” specifiers
• would lead to 100’s of bits• common case only needs a fraction a lot of waste
Lecture 6: Superscalar Decode and Other Pipelining
23
x86 RISC-like mops• Each x86 instruction decoded into a variable
number of “uops” (micro-ops - Intel) or ROPs (RISC ops - AMD)– Each uop is RISC-like– Uops have limitations to keep union of info practical
Lecture 6: Superscalar Decode and Other Pipelining
ADD EAX, EBX ADD EAX, EBX 1 uop
ADD EAX, [EBX] Load tmp = [EBX]ADD EAX, tmp
2 uops
ADD [EAX], EBX Load tmp = [EAX]ADD tmp, EBX
STA EAXSTD tmp
4 uops
24
uop Limits• How many uops can a decoder generate?
– For complex x86 insts, many are needed (10’s, 100’s?)
– Makes decoder horribly complex– Typically there’s a limit to keep complexity
under control• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops
• Ok, what happens if a complex instruction needs more than 4 uops?
Lecture 6: Superscalar Decode and Other Pipelining
25
UROM/MS for Complex x86 Insts• UROM (mcode-ROM) stores the uop
equivalents for nasty x86 instructions– “Nasty” could be large/complex (> 4 uops like
PUSHA or STRREP.MOV) or obsolete instructions (AAA)
• Microsequencer (MS) is the control logic that interfaces between the post-decode pipestages, the UROM, the decoders and the PC-generation
Lecture 6: Superscalar Decode and Other Pipelining
26
UROM/MS Example (3 uop-wide)
Lecture 6: Superscalar Decode and Other Pipelining
ADD
STORE
SUB
Cycle 1
ADD
STA
STD
SUB
Cycle 2
REP.MOV
ADD [ ]
Fetch- x86 insts
Decode - uops
UROM - uops
SUB
LOAD
STORE
REP.MOV
ADD [ ]
XOR
Cycle 3
INC
mJCC
LOAD
Cycle 4
STORE
INC
mJCC
Cycle 5
…
Cycle …
…
…
mJCCREP.MOV
ADD [ ]
XOR
LOAD
Cycle n Cycle n+1
ADD
Complex instructions, getuops from mcode sequencer
27
Superscalar CISC Decode
1. Instruction Length Decode (ILD)• Where are the instructions?
• Limited decode – just enough to parse prefixes, modes
2. Shift/Alignment• Get the right bytes to the decoders
3. Decode• Crack into uops
Lecture 6: Superscalar Decode and Other Pipelining
And then do this for N instructions per cycle!
28
ILD Recurrence/Loop• PCi = X
• PCi+1= PCi + sizeof( Mem[PCi] )
• PCi+2= PCi+1 + sizeof( Mem[PCi+1] )
= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )
• Can’t find start of next instruction without decoding the first
• Critical loop not pipelineable– ILD of 4 instructions per cycle imples that clock cycle
time will be 4 x latency(ILD)
Lecture 6: Superscalar Decode and Other Pipelining
29
Decode Implementation
Lecture 6: Superscalar Decode and Other Pipelining
Left Shifter
Decoder 3
Cycl
e 3
Decoder 2Decoder 1
ILD dominatescycle time; not
scalable
Instruction Bytes (ex. 16 bytes)
ILD (limited decode)Length 1
Length 2
+
Cycl
e 1
Inst 1 Inst 2 Inst 3 Remainder
Cycl
e 2
+
Length 3
bytesdecoded
ILD (limited decode)
ILD (limited decode)
30
Hardware-Intensive Decode
Lecture 6: Superscalar Decode and Other Pipelining
Decode from everypossible instructionstarting point!
Giant MUXes toselect instructionbytes
ILD
DecoderIL
D
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
ILD
DecoderDecoder
31
ILD in Hardware-Intensive Approach
Lecture 6: Superscalar Decode and Other Pipelining
6 bytes
4 bytes
3 bytes+
Total bytes decode = 11Previous: 3 ILD + 2add
Now: 1ILD + 2(mux+add)
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
Deco
der
32
Predecoding• ILD loop is hardware intensive, impacts
latency, and can consume substantial power
• Observation: when instructions A, B and C are decoded into lengths 3, 5 and 1, the next time we encounter A, B and C, their lengths will still be the same!– cache the ILD work– do once, reuse many times
Lecture 6: Superscalar Decode and Other Pipelining
33
Decoder Example: AMD K5
Lecture 6: Superscalar Decode and Other Pipelining
Predecode Logic
From Memory
b0 b1 b2 … b78 bytes
I$
b0 b1 b2 … b7
8 bits
+5 bits
13 bytes
Decode
16 (8-bit inst + 5-bit predecode)
8 (8-bit inst + 5-bit predecode)
Up to 4 ROPs
34
Decoder Example: AMD K5• Predecode information makes decode
easier– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations
• Power/performance tradeoffs– Larger I$ (increase data by 62.5%)
• Longer I$ latency, More I$ power consumption– Remove logic from decode
• Shorter branch mispred penalty, simpler logic• Cache and reused decode work less decode power
– Longer effective IL1 miss latency
Lecture 6: Superscalar Decode and Other Pipelining
35
Limits on Decode• Max branches (color allocation)• Taken branches• Incomplete instructions
– x86 insts are not aligned, may span two cache lines
– can’t decode until both halves have been fetched
• Instruction complexity– decoding “complex” (2-4 uop) instructions
requires a more complex decoder; expensive to replicate
– compromise: fewer complex decoders plus simpler decoders for instructions with single-uop mappings
Lecture 6: Superscalar Decode and Other Pipelining
36
Decoder Example: Intel P-Pro
Lecture 6: Superscalar Decode and Other Pipelining
16 Raw Instruction Bytes
Decoder0
Decoder1
Decoder2
BranchAddress
Calculator
Fetch resteerif needed
mROM
4 uops 1 uop 1 uop
If instruction in Decoder 1 or 2 requires > 1 uop, do not generateany output, and then shift to Decoder to the left on next cycle
Only Decoder 0 can interface with the uROM and MS
37
Decoder Example: Intel P4
Lecture 6: Superscalar Decode and Other Pipelining
L2 Cache
Raw instruction bytes
4-uop DecoderuROM
trace const. buffer
Trace Cache
Decode at mostone inst per cycle
Fetch up to 3 uopsper cycle
P4 has a strangled front-end, at best it can only deliver 3 uops per cycle; contrast to P-Pro that can deliver up to 6 uops per cycle (if they’re 4/1/1)
More on this when westudy the P4 in detail
38
Pipeline Control
Lecture 6: Superscalar Decode and Other Pipelining
Dispatch ROB, LSQ, RSfull, stall front-end
Disp
ROB, LSQ, RSfull, stall front-end
RenDecDecRotILDI$I$BPred
Except not everyone stalls…
This logic starting toget pretty intense
DecodeFetch
39
Just because there’s a stall conditionsomewhere does not imply that
everybody has to stall
Non-Uniform Stall
Lecture 6: Superscalar Decode and Other Pipelining
Disp
nops due to I$ miss
ROBFull!
not full
full
full
full
full
RenDecDecRotILDI$I$BP
40
Compressing/Serpentine Pipelines
Lecture 6: Superscalar Decode and Other Pipelining
Disp
ROBFull!
1 entryfree
RenDecDecRotILDI$I$BP
Better “flow”, but much morecomplex since need to track howmany insts can advance per stage
41
Lots o’ Stalls• I$ miss, ITLB miss• Decoder limitations
– x86 4-1-1 limit– branch limits (max/cycle, max taken/cycle)
• Renamer – out of physical registers
Lecture 6: Superscalar Decode and Other Pipelining
42
Smaller Control Domains• Separate long pipeline into multiple smaller
pipelines
Lecture 6: Superscalar Decode and Other Pipelining
BPred I$ I$ Dec Dec Dec Ren Alloc Sched
BPred I$ I$ Dec Dec Dec
Ren Alloc Sched
43
Smaller Control Domains (2)
• Non-decoupled pipe needed logic to simultaneously control ~10 stages
• Decoupled pipe needs multiple control logic circuits– each only needs to interact with ~5 stages (~3
real stages, plus the queue ahead and behind)
Lecture 6: Superscalar Decode and Other Pipelining
Pipeline ControlLogic
Non-decoupled
I$ Dec Dec Dec Ren
Pipeline ControlLogic
Decoupled
XX
No direct control logicfor stages outside of
local pipeline
44
Smaller Control Domains (3)
• Queues can effectively add more pipeline stages
Lecture 6: Superscalar Decode and Other Pipelining
cycleboundary
previous stage next stage
Inter-pipe queueenqueue
logic latchdequeue
logic
previous stage next stage
• Avoid this by writing and reading in the same cycle (affects timing, complexity)
45
Queues provide Smoothing• Approximation to serpentine pipes
(compress only at certain locations – i.e., the queues)
• Different levels of decoupling possible depending on frequency target, power, complexity tolerance
Lecture 6: Superscalar Decode and Other Pipelining
BPred I$ I$ Dec Ren Sched
The “SimpleScalar” pipeline
Note: RS is effectivelya queue (more later)
46
Different Clocking Domains• Decoupling the pipe allows each segment
to operate independently (local control)• Also means each can run at different
speeds (P4)
Lecture 6: Superscalar Decode and Other Pipelining
TC TC Dec … Alloc
Sched … WB
(ROB)
Commit
(IAQ)
(uopQ)
1x freq(3 uops/Mclk)½x freq
(6 uops/2Mclk’s 3 uops/Mclk)
2x freq(2 uops/Fclk 4 uops/Mclk)
1x freq(3 uops/Mclk)