An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James Smith Ilhyun Kim
Feb 25, 2016
An Approach for Implementing Efficient Superscalar CISC
Processors
Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim
HPCA 2006, Austin, TX 2
Processor Design Challenges• CISC challenges -- Suboptimal internal micro-ops.
– Complex decoders & obsolete features/instructions – Instruction count expansion: 40% to 50% mgmt, comm … – Redundancy & Inefficiency in the cracked micro-ops – Solution: Dynamic optimization
• Other current challenges (CISC & RISC)– Efficiency (Nowadays, less performance gain per transistor) – Power consumption has become acute – Solution: Novel efficient microarchitectures
HPCA 2006, Austin, TX 3
Dynamic Translation
Implementation ISAe.g. fusible ISA
Software in Architected ISA: OS, Drivers, Lib code, Apps
HW Implementation: Processors, Mem-sys, I/O devices
Architected ISAe.g. x86
Solution: Architecture Innovations
• ISA mapping: – Hardware: Simple translation, good for startup performance. – Software: Dynamic optimization, good for hotspots.
• Can we combine the advantages of both? – Startup: Fast, simple translation – Steady State: Intelligent translation/optimization, for hotspots.
Pipeline
Decoders
ConventionalHW design
PipelineCode $
SoftwareBinary
Translator
VM paradigm
HPCA 2006, Austin, TX 4
Microarchitecture: Macro-op Execution
• Enhanced OoO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions
throughout the entire pipeline– Analogy: All lanes car-pool on highway reduce congestion w/
high throughput, AND raise the speed limit from 65mph to 80mph.
DecodeRenameDispatch
Wake-up RFSelect EXEFetch MEM
cacheports
AlignFuse
Fusebit
3-1 ALUs
RetireWB
HPCA 2006, Austin, TX 5
Related Work: x86 processors• AMD K7/K8
microarchitecture – Macro-Operations – High performance,
efficient pipeline• Intel Pentium M
– Micro-op fusion. – Stack manager. – High performance,
low power.
• Transmeta x86 processors – Co-Designed x86 VM – VLIW engine + code
morphing software.
HPCA 2006, Austin, TX 6
Related Work• Co-designed VM: IBM DAISY, BOA
– Full system translator on tree regions + VLIW engine – Other research projects: e.g. DBT for ILDP
• Macro-op execution – ILDP, Dynamic Strands, Dataflow Mini-graph, CCG.
– Fill Unit, SCISM, rePLay, PARROT.
• Dynamic Binary Translation / Optimization– SW based: (Often user mode only) UQBT, Dynamo (RIO),
IA-32 EL. Java and .NET HLL VM runtime systems
– HW based: Trace cache fill units, rePLay, PARROT, etc
HPCA 2006, Austin, TX 7
I-$Code $
(Macro-op)
MemoryHierarchy
verticalx86
decoder
horizontalmicro / Macro-op
decoder
Rename/Dispatch
PipelineEXE
backend
Issuebuffer
VM translation /
optimization software
x86 code
Co-designed x86 processor architecture
• Co-designed virtual machine paradigm– Startup: Simple hardware decode/crack for fast translation – Steady State: Dynamic software translation/optimization for
hotspots.
12
HPCA 2006, Austin, TX 8
Fusible Instruction Set
• RISC-ops with unique features: – A fusible bit per
instr. for fusing
– Dense encoding, 16/32-bit ISA
• Special Features to Support x86 – Condition codes
– Addressing modes
– Aware of long immediate values
-21-bit Immediate / Displacement / 10 b opcode
11b Immediate / Disp 5b Rds
5b Rsrc
-16-bit immediate / Displacement
F
-Core 32-bit instruction formats
-Add-on 16-bit instruction formats for code density
Fusible ISA Instruction Formats
10 b opcode
10 b opcode
16 bit opcode 5b Rsrc
5b Rsrc
5b Rds
5b Rds
5b Rds
5b Rds5b Rsrc
5b Immd
10b Immediate / Disp5b opcode
5b opcode
5b opcode
F
F
F
F
F
F
HPCA 2006, Austin, TX 9
Macro-op Fusing Algorithm• Objectives:
– Maximize fused dependent pairs – Simple & Fast
• Heuristics: – Pipelined Scheduler: Only single-cycle ALU ops can be a
head. Minimize non-fused single-cycle ALU ops– Criticality: Fuse instructions that are “close” in the original
sequence. ALU-ops criticality is easier to estimate. – Simplicity: 2 or less distinct register operands per fused pair
• Solution: Two-pass Fusing Algorithm:– The 1st pass, forward scan, prioritizes ALU ops, i.e. for each
ALU-op tail candidate, look backward in the scan for its head– The 2nd pass considers all kinds of RISC-ops as tail candidates
HPCA 2006, Austin, TX 10
Fusing Algorithm: Example
x86 asm:-----------------------------------------------------------1. lea eax, DS:[edi + 01]2. mov [DS:080b8658], eax3. movzx ebx, SS:[ebp + ecx << 1]4. and eax, 0000007f5. mov edx, DS:[eax + esi << 0 + 0x7c]
RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]
After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]
HPCA 2006, Austin, TX 11
Instruction Fusing Profile
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Per
cent
age
of D
ynam
ic In
stru
ctio
ns
ALU
FP or NOPs
BR
ST
LD
Fused
• 55+% fused RISC-ops increases effective ILP by 1.4 • Only 6% single-cycle ALU ops left un-fused.
HPCA 2006, Austin, TX 12
RenameDispatchwakeupFetch Align Payload RF EXE WB Retirex86Decode3 Select
x86Decode2
X86Decode1
Pipelined 2-cycle Issue Logic
RenameDispatchwakeupFetchAlign/Fuse Payload RF EXE WB RetireDecode SelectMacro-op
Pipeline-
x86 Pipeline
Processor Pipeline
• Macro-op pipeline for efficient hotspot execution– Execute macro-ops – Higher IPC, and Higher clock speed potential – Shorter pipeline front-end
Reduced Instr. traffic throughout
Pipelined schedule
r
Reduced forwarding
HPCA 2006, Austin, TX 13
1 23
45
slot 0 slot 1 slot 2
1 2 3 4 5 6Align /
1 23
45
Fuse
Decode
Dispatch
Rename
Fetch
1 2323
4545
slot 0 slot 1 slot 2
16 Bytes
1 2 3 4 5 6Align /
1 23
45
Fuse
Decode
Dispatch
Rename
Co-designed x86 pipeline frond-end
HPCA 2006, Austin, TX 14
Wakeup
Select
Payload
RF
EXE
WB/ Mem
2-cycle Macro-op Scheduler
lane 0dual entry
lane 02 read ports
lane 0dual entry
issue port 0
lane 02 read ports
lane 1dual entry
lane 12 read ports
lane 1dual entry
issue port 1
lane 12 read ports
Mem Port 0
ALU0 3-1ALU0
lane 2dual entry
lane 22 read ports
lane 2dual entry
issue port 2
lane 22 read ports
Mem Port 1
ALU1 ALU2 3-1ALU23-1ALU1
Co-designed x86 pipeline backend
HPCA 2006, Austin, TX 15
Experimental Evaluation• x86vm: Experimental framework for exploring the
co-designed x86 virtual machine paradigm.
• Proposed co-designed x86 processor – A specific instantiation of the framework. – Software components: VMM – DBT, Code caches, VM
runtime control and resource management system (Extracted some source code from BOCHS 2.2)
– Hardware components: Microarchitecture timing simulators, Baseline OoO Superscalar, Macro-op Execution, etc.
• Benchmarks: SPEC2000 integer
HPCA 2006, Austin, TX 16
Performance Evaluation: SPEC2000
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
16 32 48 64issue window size
Rel
ativ
e IP
C p
erfo
rman
ce
4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base
HPCA 2006, Austin, TX 17
Performance Contributors • Many factors contribute to the IPC performance
improvement: – Code straightening, – Macro-op fusing and execution. – Reduce pipeline front-end (reduce branch penalty)– Collapsed 3-1 ALUs (resolve branches & addresses sooner).
• Besides baseline and macro-op models, we model three middle configurations:– M0: baseline + code cache – M1: M0 + macro-op fusing. – M2: M1 + shorter pipeline front-end. (Macro-op mode)– Macro-op: M2 + collapsed 3-1 ALUs.
HPCA 2006, Austin, TX 18
Performance Contributors: SPEC2000
-10
0
10
20
30
40
50
60
70
Noma
rlized
IPC
spee
dup (
%)
M0: Base + Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU
HPCA 2006, Austin, TX 19
Conclusions • Architecture Enhancement
– Hardware/Software co-designed paradigm enable novel designs & more desirable system features
– Fuse dependent instruction pairs collapse dataflow graph to increase ILP
• Complexity Effectiveness– Pipelined 2-cycle instruction scheduler– Reduce ALU value forwarding network significantly – DBT software reduces hardware complexity
• Power Consumption Implication – Reduced pipeline width – Reduced Inter-instruction communication and instruction
management
HPCA 2006, Austin, TX 20
Finale – Questions & Answers
Suggestions and comments are welcome, Thank you!
HPCA 2006, Austin, TX 21
OutlineOutline
• Motivation & Introduction• Processor Microarchtecture
Details• Evaluation & Conclusions
HPCA 2006, Austin, TX 22
Performance Simulation Configuration
BASELINE BASELINE PIPELINED MACRO-OP
ROB Size 128 128 128
Retire width 3,4 3,4 2,3,4 MOP
Scheduler Pipeline Stages 1 2 2
Fuse RISCops ? No No Yes
Issue Width 3,4 3,4 2,3,4 MOP
Issue Window Size Variable. Sample points: from 16, up to 64. Effectively larger for the macro-op mode.
Register File 128 entries, 8,10 Read ports, 5,6 Write ports 128 entries, 6,8,10 Read &
6,8,10 Write ports
Functional Units 4,6,8 INT ALU, 2 MEM R/W ports, 2 FP ALU
Cache Hierarchy 4-way 32KB L1-I, 4-way 32KB L1-D, 8-way 1 MB L2
Cache/Memory Latency L1 : 2 cycles + 1 cycle AGU, L2 : 8 cycles, Mem: 200 cycles for the 1st chunk, 6 cycles b/w chunks
Fetch width 16-Bytes x86 instructions 16B fusible micro-ops
HPCA 2006, Austin, TX 23
Fuse Macro-ops: An Illustrative Example
x86 instructions Fusible ISA Execution Latency
1 LD Rtmp, [Rebx + 02] 3 2 cmp ds:[ebx + 02], 0d CMP Rtmp, 0d :: Jz 2f 1 3 jnz 08115ae1 4 jmp 08115bf2 (direct jmp removed) 5 add esp, 0c ADD.cc Resp, 0c :: LD Rebx,[Resp] 3 6 pop ebp ADD Resp, 4 :: LD Rtmp,[Resp] 3 7 ret_near ADD Resp, 4 1 8 BR.ret Rtmp 1 16 Bytes
6 x86 instructions 20 Bytes, 9 RISC-like instructions. Fused into 6 macro-ops, 6 issue queue slots & issues
HPCA 2006, Austin, TX 24
Translation FrameworkDynamic binary translation framework:1. Form hotspot superblock. Crack x86 instructions into RISC-style
micro-ops2. Perform Cluster Analysis of embedded long immediate values
and assign to registers if necessary. 3. Generate RISC-ops (IR form) in the implementation ISA 4. Construct DDG (Data Dependency Graph) for the superblock5. Fusing Algorithm: Scan looking for dependent pairs to be fused.
Forward scan, backward pairing. Two-pass to prioritize ALU ops. 6. Assign registers; re-order fused dependent pairs together, extend
live ranges for precise traps, use consistent state mapping at superblock exits
7. Code generation to code cache
HPCA 2006, Austin, TX 25
Other DBT Software Profile
• Of all fused macro-ops: – 50% ALU-ALU pairs. – 30% fused condition test & conditional branch pairs. – Others mostly ALU-MEM ops pairs.
• Of all fused macro-ops: – 70+% are inter-x86instruction fusion. – 46% access two distinct source registers, – only 15% (6% of all instruction entities) write two distinct
destination registers.
• Translation Overhead Profile– About 1000+ instructions per translated hotspot instruction.
HPCA 2006, Austin, TX 26
A
C
B
D
A
D
B
CN
Head
Tail
YX
Head
Tail
a b
A
D
B
C
c
?
Dependence Cycle Detection
• All cases are generalized to (c) due to Anti-Scan Fusing Heuristic
HPCA 2006, Austin, TX 27
HST back-end profile
• Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk.
• Heavy-weight opts: uops translation, fusing, codegen – none dominates
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
Num
ber o
f x86
instr
uctio
ns
ProcLongImm xlate_uop DDGsetup Fuse macro-ops Codegen
HPCA 2006, Austin, TX 28
Hotspot Coverage vs. runs
0
10
20
30
40
50
60
70
80
90
100
Hot
spot
Cov
erag
e %
100M TestRun RefRun
HPCA 2006, Austin, TX 29
Hotspot Detected vs. runs
0
5
10
15
20
25
30
35
40
164.g
zip
175.v
pr
176.g
cc
181.m
cf
186.c
rafty
197.p
arser
252.e
on
253.p
erlbm
k
254.g
ap
255.v
ortex
256.b
zip2
300.t
wolf
Ove
rhea
d: In
s xl
ated
per
milli
on In
s E
xe
100M TestRun RefRun
HPCA 2006, Austin, TX 30
Performance Evaluation: SPEC2000
0.50.60.70.80.9
11.11.21.3
16 32 48 64
4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base
HPCA 2006, Austin, TX 31
Performance evaluation (WSB2004)
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
16 32 48 64issue buffer size
Rela
tive
IPC
perfo
rman
ce
4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base
HPCA 2006, Austin, TX 32
Performance Contributors (WSB2004)
-10
-5
0
5
10
15
20
25
30
35
40
Nom
arliz
ed IP
C s
peed
up (%
)
M0: Base+Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU
HPCA 2006, Austin, TX 33
Future Directions• Co-Designed Virtual Machine Technology:
– Confidence: More realistic benchmark study – important for whole workload behavior such as hotspot behavior and impact of context switches.
– Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques.
– Application: Specific enabling techniques for specific novel computer architectures of the future.
• Example co-designed x86 processor design: – Confidence Study as above. – Enhancement: HW μ-Arch Reduce register write ports.
VMM More dynamic optimizations in HST, e.g. CSE, software stack manager, SIMDification.