An Approach for Implementing Efficient Superscalar CISC Processors

An Approach for Implementing Efficient Superscalar CISC

Processors

Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 2

Processor Design Challenges• CISC challenges -- Suboptimal internal micro-ops.

– Complex decoders & obsolete features/instructions – Instruction count expansion: 40% to 50% mgmt, comm … – Redundancy & Inefficiency in the cracked micro-ops – Solution: Dynamic optimization

• Other current challenges (CISC & RISC)– Efficiency (Nowadays, less performance gain per transistor) – Power consumption has become acute – Solution: Novel efficient microarchitectures


Dynamic Translation

Implementation ISAe.g. fusible ISA

Software in Architected ISA: OS, Drivers, Lib code, Apps

HW Implementation: Processors, Mem-sys, I/O devices

Architected ISAe.g. x86

Solution: Architecture Innovations

• ISA mapping: – Hardware: Simple translation, good for startup performance. – Software: Dynamic optimization, good for hotspots.

• Can we combine the advantages of both? – Startup: Fast, simple translation – Steady State: Intelligent translation/optimization, for hotspots.

Pipeline

Decoders

ConventionalHW design

PipelineCode $

SoftwareBinary

Translator

VM paradigm


Microarchitecture: Macro-op Execution

• Enhanced OoO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions

throughout the entire pipeline– Analogy: All lanes car-pool on highway reduce congestion w/

high throughput, AND raise the speed limit from 65mph to 80mph.

DecodeRenameDispatch

Wake-up RFSelect EXEFetch MEM

cacheports

AlignFuse

Fusebit

3-1 ALUs

RetireWB


Related Work: x86 processors• AMD K7/K8

microarchitecture – Macro-Operations – High performance,

efficient pipeline• Intel Pentium M

– Micro-op fusion. – Stack manager. – High performance,

low power.

• Transmeta x86 processors – Co-Designed x86 VM – VLIW engine + code

morphing software.


Related Work• Co-designed VM: IBM DAISY, BOA

– Full system translator on tree regions + VLIW engine – Other research projects: e.g. DBT for ILDP

• Macro-op execution – ILDP, Dynamic Strands, Dataflow Mini-graph, CCG.

– Fill Unit, SCISM, rePLay, PARROT.

• Dynamic Binary Translation / Optimization– SW based: (Often user mode only) UQBT, Dynamo (RIO),

IA-32 EL. Java and .NET HLL VM runtime systems

– HW based: Trace cache fill units, rePLay, PARROT, etc


I-$Code $

(Macro-op)

MemoryHierarchy

verticalx86

decoder

horizontalmicro / Macro-op

decoder

Rename/Dispatch

PipelineEXE

backend

Issuebuffer

VM translation /

optimization software

x86 code

Co-designed x86 processor architecture

• Co-designed virtual machine paradigm– Startup: Simple hardware decode/crack for fast translation – Steady State: Dynamic software translation/optimization for

hotspots.

12


Fusible Instruction Set

• RISC-ops with unique features: – A fusible bit per

instr. for fusing

– Dense encoding, 16/32-bit ISA

• Special Features to Support x86 – Condition codes

– Addressing modes

– Aware of long immediate values

-21-bit Immediate / Displacement / 10 b opcode

11b Immediate / Disp 5b Rds

5b Rsrc

-16-bit immediate / Displacement

F

-Core 32-bit instruction formats

-Add-on 16-bit instruction formats for code density

Fusible ISA Instruction Formats

10 b opcode

10 b opcode

16 bit opcode 5b Rsrc

5b Rsrc

5b Rds

5b Rds

5b Rds

5b Rds5b Rsrc

5b Immd

10b Immediate / Disp5b opcode

5b opcode

5b opcode

F

F

F

F

F

F


Macro-op Fusing Algorithm• Objectives:

– Maximize fused dependent pairs – Simple & Fast

• Heuristics: – Pipelined Scheduler: Only single-cycle ALU ops can be a

head. Minimize non-fused single-cycle ALU ops– Criticality: Fuse instructions that are “close” in the original

sequence. ALU-ops criticality is easier to estimate. – Simplicity: 2 or less distinct register operands per fused pair

• Solution: Two-pass Fusing Algorithm:– The 1st pass, forward scan, prioritizes ALU ops, i.e. for each

ALU-op tail candidate, look backward in the scan for its head– The 2nd pass considers all kinds of RISC-ops as tail candidates


Fusing Algorithm: Example

x86 asm:-----------------------------------------------------------1. lea eax, DS:[edi + 01]2. mov [DS:080b8658], eax3. movzx ebx, SS:[ebp + ecx << 1]4. and eax, 0000007f5. mov edx, DS:[eax + esi << 0 + 0x7c]

RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]

After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]


Instruction Fusing Profile

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Per

cent

age

of D

ynam

ic In

stru

ctio

ns

ALU

FP or NOPs

BR

ST

LD

Fused

• 55+% fused RISC-ops increases effective ILP by 1.4 • Only 6% single-cycle ALU ops left un-fused.


RenameDispatchwakeupFetch Align Payload RF EXE WB Retirex86Decode3 Select

x86Decode2

X86Decode1

Pipelined 2-cycle Issue Logic

RenameDispatchwakeupFetchAlign/Fuse Payload RF EXE WB RetireDecode SelectMacro-op

Pipeline-

x86 Pipeline

Processor Pipeline

• Macro-op pipeline for efficient hotspot execution– Execute macro-ops – Higher IPC, and Higher clock speed potential – Shorter pipeline front-end

Reduced Instr. traffic throughout

Pipelined schedule

r

Reduced forwarding


1 23

45

slot 0 slot 1 slot 2

1 2 3 4 5 6Align /

1 23

45

Fuse

Decode

Dispatch

Rename

Fetch

1 2323

4545

slot 0 slot 1 slot 2

16 Bytes

1 2 3 4 5 6Align /

1 23

45

Fuse

Decode

Dispatch

Rename

Co-designed x86 pipeline frond-end


Wakeup

Select

Payload

RF

EXE

WB/ Mem

2-cycle Macro-op Scheduler

lane 0dual entry

lane 02 read ports

lane 0dual entry

issue port 0

lane 02 read ports

lane 1dual entry

lane 12 read ports

lane 1dual entry

issue port 1

lane 12 read ports

Mem Port 0

ALU0 3-1ALU0

lane 2dual entry

lane 22 read ports

lane 2dual entry

issue port 2

lane 22 read ports

Mem Port 1

ALU1 ALU2 3-1ALU23-1ALU1

Co-designed x86 pipeline backend


Experimental Evaluation• x86vm: Experimental framework for exploring the

co-designed x86 virtual machine paradigm.

• Proposed co-designed x86 processor – A specific instantiation of the framework. – Software components: VMM – DBT, Code caches, VM

runtime control and resource management system (Extracted some source code from BOCHS 2.2)

– Hardware components: Microarchitecture timing simulators, Baseline OoO Superscalar, Macro-op Execution, etc.

• Benchmarks: SPEC2000 integer


Performance Evaluation: SPEC2000

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64issue window size

Rel

ativ

e IP

C p

erfo

rman

ce

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base


Performance Contributors • Many factors contribute to the IPC performance

improvement: – Code straightening, – Macro-op fusing and execution. – Reduce pipeline front-end (reduce branch penalty)– Collapsed 3-1 ALUs (resolve branches & addresses sooner).

• Besides baseline and macro-op models, we model three middle configurations:– M0: baseline + code cache – M1: M0 + macro-op fusing. – M2: M1 + shorter pipeline front-end. (Macro-op mode)– Macro-op: M2 + collapsed 3-1 ALUs.


Performance Contributors: SPEC2000

-10

0

10

20

30

40

50

60

70

Noma

rlized

IPC

spee

dup (

%)

M0: Base + Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU


Conclusions • Architecture Enhancement

– Hardware/Software co-designed paradigm enable novel designs & more desirable system features

– Fuse dependent instruction pairs collapse dataflow graph to increase ILP

• Complexity Effectiveness– Pipelined 2-cycle instruction scheduler– Reduce ALU value forwarding network significantly – DBT software reduces hardware complexity

• Power Consumption Implication – Reduced pipeline width – Reduced Inter-instruction communication and instruction

management


Finale – Questions & Answers

Suggestions and comments are welcome, Thank you!


OutlineOutline

• Motivation & Introduction• Processor Microarchtecture

Details• Evaluation & Conclusions


Performance Simulation Configuration

BASELINE BASELINE PIPELINED MACRO-OP

ROB Size 128 128 128

Retire width 3,4 3,4 2,3,4 MOP

Scheduler Pipeline Stages 1 2 2

Fuse RISCops ? No No Yes

Issue Width 3,4 3,4 2,3,4 MOP

Issue Window Size Variable. Sample points: from 16, up to 64. Effectively larger for the macro-op mode.

Register File 128 entries, 8,10 Read ports, 5,6 Write ports 128 entries, 6,8,10 Read &

6,8,10 Write ports

Functional Units 4,6,8 INT ALU, 2 MEM R/W ports, 2 FP ALU

Cache Hierarchy 4-way 32KB L1-I, 4-way 32KB L1-D, 8-way 1 MB L2

Cache/Memory Latency L1 : 2 cycles + 1 cycle AGU, L2 : 8 cycles, Mem: 200 cycles for the 1st chunk, 6 cycles b/w chunks

Fetch width 16-Bytes x86 instructions 16B fusible micro-ops


Fuse Macro-ops: An Illustrative Example

x86 instructions Fusible ISA Execution Latency

1 LD Rtmp, [Rebx + 02] 3 2 cmp ds:[ebx + 02], 0d CMP Rtmp, 0d :: Jz 2f 1 3 jnz 08115ae1 4 jmp 08115bf2 (direct jmp removed) 5 add esp, 0c ADD.cc Resp, 0c :: LD Rebx,[Resp] 3 6 pop ebp ADD Resp, 4 :: LD Rtmp,[Resp] 3 7 ret_near ADD Resp, 4 1 8 BR.ret Rtmp 1 16 Bytes

6 x86 instructions 20 Bytes, 9 RISC-like instructions. Fused into 6 macro-ops, 6 issue queue slots & issues


Translation FrameworkDynamic binary translation framework:1. Form hotspot superblock. Crack x86 instructions into RISC-style

micro-ops2. Perform Cluster Analysis of embedded long immediate values

and assign to registers if necessary. 3. Generate RISC-ops (IR form) in the implementation ISA 4. Construct DDG (Data Dependency Graph) for the superblock5. Fusing Algorithm: Scan looking for dependent pairs to be fused.

Forward scan, backward pairing. Two-pass to prioritize ALU ops. 6. Assign registers; re-order fused dependent pairs together, extend

live ranges for precise traps, use consistent state mapping at superblock exits

7. Code generation to code cache


Other DBT Software Profile

• Of all fused macro-ops: – 50% ALU-ALU pairs. – 30% fused condition test & conditional branch pairs. – Others mostly ALU-MEM ops pairs.

• Of all fused macro-ops: – 70+% are inter-x86instruction fusion. – 46% access two distinct source registers, – only 15% (6% of all instruction entities) write two distinct

destination registers.

• Translation Overhead Profile– About 1000+ instructions per translated hotspot instruction.


A

C

B

D

A

D

B

CN

Head

Tail

YX

Head

Tail

a b

A

D

B

C

c

?

Dependence Cycle Detection

• All cases are generalized to (c) due to Anti-Scan Fusing Heuristic


HST back-end profile

• Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk.

• Heavy-weight opts: uops translation, fusing, codegen – none dominates

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

Num

ber o

f x86

instr

uctio

ns

ProcLongImm xlate_uop DDGsetup Fuse macro-ops Codegen


Hotspot Coverage vs. runs

0

10

20

30

40

50

60

70

80

90

100

Hot

spot

Cov

erag

e %

100M TestRun RefRun


Hotspot Detected vs. runs

0

5

10

15

20

25

30

35

40

164.g

zip

175.v

pr

176.g

cc

181.m

cf

186.c

rafty

197.p

arser

252.e

on

253.p

erlbm

k

254.g

ap

255.v

ortex

256.b

zip2

300.t

wolf

Ove

rhea

d: In

s xl

ated

per

milli

on In

s E

xe

100M TestRun RefRun


Performance Evaluation: SPEC2000

0.50.60.70.80.9

11.11.21.3

16 32 48 64



Performance evaluation (WSB2004)

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

16 32 48 64issue buffer size

Rela

tive

IPC

perfo

rman

ce



Performance Contributors (WSB2004)

-10

-5

0

5

10

15

20

25

30

35

40

Nom

arliz

ed IP

C s

peed

up (%

)

M0: Base+Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU


Future Directions• Co-Designed Virtual Machine Technology:

– Confidence: More realistic benchmark study – important for whole workload behavior such as hotspot behavior and impact of context switches.

– Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques.

– Application: Specific enabling techniques for specific novel computer architectures of the future.

• Example co-designed x86 processor design: – Confidence Study as above. – Enhancement: HW μ-Arch Reduce register write ports.

VMM More dynamic optimizations in HST, e.g. CSE, software stack manager, SIMDification.

An Approach for Implementing Efficient Superscalar CISC Processors

Documents