Top Banner
Automatic Processor Specialisation using Ad-hoc Functional Units [email protected] , [email protected] , Miljan [email protected] EPFL – I&C – LAP
42
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

Automatic Processor Specialisation using

Ad-hoc Functional Units

[email protected], [email protected], [email protected]

EPFL – I&C – LAP

Page 2: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation2

Classic Options for Systems-on-Chip

Design Gap!

Page 3: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation3

Processor Specialisation:Get the Best of Both Options

Embedded!

Page 4: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation4

VLIW Processor Specialisation

Two complementary specialisation strategies:Parametric Architecture

Ad-hoc Functional Units (AFUs)

Page 5: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation5

Automatically Collapsing Clusters of Instructions into New Ones

If the ad-hoc functional unit completes the

job faster GAIN

One ad-hoc complex operation instead of a long

sequence of standard ones

Page 6: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation6

General Goal

Automatically achieve

processor specialisation

through high-level

application code analysis

Page 7: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation7

Outline

IntroductionMotivational exampleGoalsOpportunities for specialisationChallenges, further opportunities,…

Page 8: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation8

Elementary Motivational ExampleAn Important Kernel…

/* init */

a <<= 8;

/* loop */

for (i = 0; i < 8; i++) {

if (a & 0x8000) {

a = (a << 1) + b;

} else {

a <<= 1;

}

}

return a & 0xffff;

Shift-and-addunsigned8 x 8-bit

multiplication

Page 9: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation9

Software Predication

/* init */

a <<= 8;

/* loop */

for (i = 0; i < 8; i++) {

p1 = - ((a & 0x8000) >> 15);

a = (a << 1) + b & p1;

}

return a & 0xffff;

Predicate mask(0 or –1 = 0xfffffff)

Shift PredicatedAdd

Page 10: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation10

Loop Kernel DAG

a

&

0x8000

>>

15

-

b

&

<<

1

+

a

In SW In HW

~6cycles

AND gates

Only wiring

ALU

1-2cycles!

Page 11: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation11

Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop

Register File

ALU LD/ST MSTEP

if (Rn [31] = = 1)then Rn (Rn << 1) + Rm

else Rn (Rn << 1)1 ad-hoc instruction added

loop kernel

reduced to 15-30%

Page 12: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation12

Loop Unrolling

/* init */

a <<= 8;

/* no loop anymore */

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

return a & 0xffff;

Page 13: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation13

Full DAG

a

b

+

a

+

+

+

+

+

+

+

&

aa b

&-network

+

a

Column Compr.

In SW

~50 c

ycle

s

In HW

~3-4

cycle

s

ArithmeticOptimiser

&

0x8000

>>

15

-

&

<<

1

+

&

0x8000

>>

15

-

&

<<

1

+

&

0x8000

>>

15

-

&

<<

1

+

Etc.

a

b<<

8

Page 14: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation14

Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL…

Register File

ALU LD/ST MUL

Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff)

1 ad-hoc instruction added

function reduced by a factor 10-15

Page 15: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation15

Classic “Ad-hoc” Customisation…

Altera Nios:

Can we do more of this, really ad-hoc?!

Page 16: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation16

Mainstream SoC/FPGA Processors and Specialisation?

All the recent embedded processors offer some sort of specialisation:

Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.)

Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.)

But all assume an onerousmanual study and design!

Page 17: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation17

Summary of Gain Potentials inAd-hoc FUs

Exploit data parallelism in hardware

Exploit constant for logic

simplification

Some operations reduce to wires in

hardware

Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry

save)

Page 18: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation18

Goals

How much scope for AFU specialisation in typical multimedia code?

Are classic ILP techniques or other optimisations (e.g., arithmetic) important to increase the speedup? To which extent?

What are the microarchitectural needs for exploiting well the potentials?Memory ports in the AFUs?Number of inputs from the register file? Are

two enough?Number of outputs to the register file? Is one

enough?

Page 19: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation19

Related Work inReconfigurable Computing

Most of the work in reconfigurable computing; typically experiments are linked to a given microarchitecture: CHIMAERA [Ye et al., 2000] has the most rich measurements

but only for 1-output AFUs and no AFU-memory interface Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et

al., 1999] use clustering approaches for 2 inputs - 1 output AFUs

GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor)

First, investigate where potentials are fix microarchitecture

Page 20: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation20

Related Work inAFU Identification

Other authors concentrate on identification methods (“what is the best function for an AFU?”) often with some microarchitectural assumptions MaxMISOs [Alippi et al., 1999] are 1-output candidates of

maximal size [Jacome et al., 2000] introduce vertical- and horizontal-

aggregation as heuristic methods to cluster operations (no comparisons with other techniques)

[Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments

ASIP synthesis: different problem (minimal covering)

First, investigate where potentials are develop appropriate identification algorithms

Page 21: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation21

Methodology

Concentrate on Data FlowEasier to capture automatically (no

architecturally visible state in the AFUs)Constant latency (variable latency would

hardly fit into a statically scheduled environment—e.g., VLIW)

Measurements on Basic BlocksRepresent the upper limit of the potential

advantagesUpper limit is reachable if microarchitectural

constraints are satisfied (e.g., no. of inputs and outputs)

Page 22: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation22

Experimental Flow

Page 23: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation23

Software Execution:Approximate RISC Model

One clock cycle assumed for most SUIF nodes, representing the usage of the execution stage

Exceptions: e.g., type casts (zero), divisions (N) Assumed all forwarding paths existing No data/instruction cache or perfect hit rates assumed Jumps accounted with a fixed amount to the cycle count of

each basic block

IF ID WB

IF ID EX

IF ID EX2

IF ID

1:

2:

3:

4:

5:

WB

EX

WB

EX

EX1 EX3

EX

ID

WB

IF WB

Page 24: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation24

Hardware Execution:Synthesis-based Model

Operator Precision Relative Delay

hw

Multiply-Accumulator 32 bits x 32 bits + 64 bits 1.00

Adder 4 bits + 4 bits 0.11

Adder 8 bits + 8 bits 0.12

Adder 16 bits + 16 bits 0.20

Adder 24 bits + 24 bits 0.24

Adder 32 bits + 32 bits 0.25

Divider 4 bits / 4 bits 0.38

Divider 8 bits / 8 bits 1.22

Divider 16 bits / 16 bits 3.68

Divider 24 bits / 24 bits 6.33

Divider 32 bits / 32 bits 9.61

Divider (by power of two) any / any 0.00

Barrel shifter 8 bits 0.08

Barrel shifter 16 bits 0.11

Barrel shifter 32 bits 0.16

Barrel shifter (by constant amount) any 0.00

Bitwise multiplexer any 0.02

CMOS 0.18µ

SynopsysDesign Compiler+ DesignWare

Page 25: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation25

Partitioning of DFGMix of Hardware and Software

AFU memory bandwidth issueOn-AFU (Hardware) and

Off-AFU (Software) instructionsDFG partitioned in HW and SW layers

High Cost!Low

Performance?

Page 26: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation26

Example of Layering Hybrid DFGs

Hardwareand

softwarelayers

Page 27: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation27

Metrics and Measurements

Topological basic block information: Inputs, outputs, etc.

Saved cycles speedup

HW

opsall

iSW CPiLat )(

_

opsSWall

iSW

layersAFUall

iHW

opsall

iSW iLatCPiLat

_____

)()(

Page 28: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation28

Basic Blocks CharacteristicsExamples

Weight

BB # In Out Ld St Tsw Thw nhw Thyb

adpcmdecode 5 22.84% 3 2 1 0 9 2.07 2 3

22 17.77% 4 3 1 1 7 1.25 1 3

9 12.69% 2 3 0 0 5 0.33 1 1

4 7.61% 1 3 1 0 6 1.00 2 3

mpeg2decode 4 37.44% 5 2 2 0 13 2.49 2 4

10 34.56% 4 2 2 0 12 2.49 2 4

pegwit 1 31.47% 2 0 296 36 811 3.65 3 335

25 9.06% 5 2 0 0 7 0.83 1 1

28 6.47% 2 0 2 1 5 2.29 2 5

9 6.45% 2 1 2 0 5 2.82 3 5

13 6.45% 2 0 2 1 5 2.29 2 5

10 5.16% 4 2 0 0 4 0.83 1 1

TopologyBenchmarkParallelMemoryAccess

SequentialMemoryAccess

Execution concentrated

in few BBs

Few Ld/St…

Small delays

…well separated

High RF pressure

Page 29: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation29

Basic Blocks Characteristics

Moderate hardware resources for AFUs:Often, half of the execution time concentrated

in not more than 2-3 basic blocks

Pressure on the register file higher than classically supported

Limited importance of memory portsExcept some dramatic cases…

Small delay of typical basic blocks

Page 30: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation30

Potential Basic SpeedupExamples

SeqMem

NoConst Basic

PlusBitwidth

PlusArith

adpcmdecode 5 1.80 12.71% 12.71% 12.71% 14.82% 14.82%

22 3.50 8.47% 10.59% 10.59% 10.59% 10.59%

9 1.67 8.47% 8.47% 8.47% 8.47% 8.47%

4 1.20 3.18% 4.24% 5.29% 5.29% 5.29%

mpeg2decode 4 2.17 24.18% 24.18% 26.87% - -

10 2.40 21.50% 21.50% 24.18% - -

pegwit 1 81.10 16.72% 28.31% 28.34% - -

25 1.75 7.03% 5.86% 7.03% - -

28 1.25 0.00% 1.17% 2.34% - -

9 1.00 0.00% 0.00% 2.33% - -

13 1.25 0.00% 1.17% 2.33% - -

10 1.33 3.50% 3.50% 3.50% - -

Benchmark Cycle Savings

BB #

ILP

Good speedup with

few BBs

Not critical…

BBs too simple to

bring advantage

Page 31: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation31

Inputs and Outputs of Basic Blocks

Speedup per # inputs Speedup per # outputs

>60% ~50%

Page 32: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation32

Potential Basic Speedup

Limited available parallelismTop-ranking basic blocks: 10 to 50% cycle

savingsHardwired constants not a key advantage

Small price for a reduction in design risk

Sequentialisation penalty not dramaticAFU memory ports not essential

Accurate bitwidth analysis and arithmetic optimizations bring limited or no advantageBasic blocks are too simple, ceiling effects,…

Page 33: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation33

Effects of ILP TechniquesExamples

Benchmark

adpcmdecode (par.) 49.45% 88.11% 88.11% 88.11% 88.11%

(2.0 x) (8.4 x) (8.4 x) (8.4 x) (8.4 x)

adpcmdecode (seq.) 45.21% 77.51% 77.51% 81.75% 81.75%

(1.8 x) (4.5 x) (4.5 x) (5.5 x) (5.5 x)

mpeg2decode (par.) 68.23% 68.23% 86.91% - 87.92%

(3.1 x) (3.1 x) (7.6 x) - (8.3 x)

mpeg2decode (seq.) 60.09% 60.09% 73.73% - 74.40%

(2.5 x) (2.5 x) (3.8 x) - (3.9 x)

pegwit (par.) 63.33% 67.31% 67.31% - -

(2.7 x) (3.1 x) (3.1 x) - -

pegwit (seq.) 38.99% 42.96% 42.96% - -

(1.6 x) (1.7 x) (1.7 x) - -

PlusUnrolling

PlusBitwidthAnalysis

PlusArithmetic

Opt.

1

1

1

Basic

3

4

2

1

2

PlusPredication

2

2

10

1

1

1

1

2

2

1

2

2

1

-

1

-

-2

-

--

1

total speedup30% number of basic blocks to reach 30% speedup

Page 34: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation34

Effects of ILP Techniques

Major improvements:Cumulative speedups between 1.7x and 6.3x

Register file pressure not significantly modified

Hardware complexity and Thw increasedArea is typically below 2-3x that of 32-bit

multiplier, almost never >10x

Accurate bitwidth analysis and arithmetic optimisations bring limited or no advantageBaseline advantage already very large

Page 35: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation35

Arithmetic Optimisation Impact

mpeg2decode basic block #7 Tsw Thw

bb

Without arithmetic transformations 55 5 25,344,000 30.6%

With arithmetic transfonmations 55 3 26,357,760 31.4%

w/o optimisation with optimisation

Page 36: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation36

Conclusions

DFG-level opens potential speedups (2–3x) at low cost (hardware and toolset) and low risk

Larger number of AFU write ports (2-3) neededHardcoding of constants not essentialAFU memory interfaces also not essentialILP techniques help, as expectedSophisticated and detailed techniques (bitwidth

analysis, arithmetic optimizations) sometimes masked by other effects

Page 37: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation37

Ongoing Work

Measure advantages through a complete toolchain (notably, compiler):DSP microarchitecture:

Validate simple model Find out bottlenecks and impose real DSP constraints

(e.g., nonortogonality)VLIW microarchitecture:

Go beyond simple software execution model

Develop novel speedup-driven identification algorithms

How to get more AFU specialisation potentialsDynamic identification and configuration of

AFUs

Page 38: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation38

+++++

MS3

Typical Identification Algorithms

Bottom-up greedy approaches to cluster instructions

Topologically-driven rather than speedup driven E.g., MaxMISO identification [Alippi et al., 1999]:

*

+

*

+

+

*

+ +

MS2

MS1+

*

+

*

Page 39: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation39

Speedup-driven Identification

Prune-out optimal set of low-speedup nodes to achieve the required input/output count

i0

0.1

i1 i2 i3 i4 i5 i6 i7

1

2

1

0.1

k

o0 o1

0.5

3

0.5

0.1

0.1

SIMD-like and unconnected

graphs

Page 40: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation40

Open Issues and Perspectives

Power consumption advantages?Power down because:

Less instruction fetches and decodes Less register reads and writebacks

Power possibly up because: Reduced correlation of signals in the AFU Low-efficiency of the implementation (in case of

eFPGAs)

More opportunities to increase speedup?Detect and implement LUTs (e.g., in

quantisers) as discrete CAMs Detect runtime constant values

Page 41: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation41

Dynamic Specialisation?

Java Bytecode

JiT + Specialisation

ARM + RFU

Dynamic compilation and optimisation together with hardware specialisation

DAISY, Crusoe, JiT, etc. Specialisation may profit

from runtime information Identification in runtime

conditions Dynamic reconfigurability

challenge

Page 42: Automatic Processor Specialisation using Ad-hoc Functional Units Paolo.Ienne@epfl.chPaolo.Ienne@epfl.ch, Laura.Pozzi@epfl.chLaura.Pozzi@epfl.ch, Miljan.Vuletic@epfl.ch.

© Ienne 2002Automatic Processor Specialisation42

Conclusions

Processor customisation opportunities are here: soft cores, FPGA processors, etc.

Very specific field of hardware/software codesign with a very large potentialDo not give up versatilityGet most of the performance of custom

hardware

Needs automation, to complement compilers and synthesizers (some work exists but limited in scope)