Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

Ajay K. Verma, Philip Brisk and Paolo Ienne

Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)

Ecole Polytechnique Fédérale de Lausanne (EPFL)

csda

csda

Fast, Quasi-Optimal, and Pipelined Fast, Quasi-Optimal, and Pipelined Instruction-Set ExtensionsInstruction-Set Extensions

2

Custom ISE IdentificationCustom ISE Identification

Register File

ALU MUL LD/ST

Data Memory

AFUout1 = F (in1, in2, in3, in4)out2 = G (in1, in2, in3, in4)

Limited number ofI/O ports

3

OutlineOutline

Problem formulation ISE selection I/O serialisation

Related work

Non-optimality of earlier work

Integer Linear Programming (ILP) formulation

Results

Conclusions

4

Problem FormulationProblem Formulation Given

a dataflow graph

a set of forbidden nodes

Find a subgraph S, which isconvex free of

forbidden nodes

And, has largest gainM (S) =

Nexec * (SW (S) – HW (S))

f

a

x2

x1 d

x3

h

b c e g

5

Convex SubgraphConvex Subgraph

d

cb

a

In order to execute the AFU we need the output of node b

Computation of node b requires the output of AFU

A non-convex AFU cannot be scheduled without creating a deadlock

6

I/O SerialisationI/O Serialisation

f

d

b c e

2 inputs, 4 outputsAvailable I/O ports: (1, 2)

cb

e

d

f

7

ISE Merit EstimationISE Merit Estimation

M (S) = Nexec * (SW (S) – HW (S))

f

a

x2

x1 d

x3

h

b c e g

cb

e

d

f

8

Related WorkRelated Work ISE identification under I/O constraints

Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07]

ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07]

ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs

[Pothineni07]

I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07]

Algorithms for specific processor models Single-issue RISC processor model [Verma07]

9

Earlier WorkEarlier Work

ISE Selection I/O Serialisation

Atasu03

Yu07

Chen07

Bonzini07

Pozzi05

Pothineni07

Optimal ISEs selection undervarious I/O constraints

Exponential time I/O serialisation algorithm

10

Non-Optimality of Earlier WorkNon-Optimality of Earlier Work

.5

.6

.5

.6

.5

.6

.3

.2

.5

.6

.5

.6

.5

.6

.3

.2

cycle saved:

23.36

cycle saved:

15.02

cycle saved: 066

cycle saved: 112

11

Our ContributionsOur Contributions

Optimal ILP formulation for a large class of processor modelsEarlier work consider RISC processor model only

Single run In the earlier work ISE selection was done for

various I/O constraints

ISE selection and I/O scheduling togetherAnother source of non-optimality of earlier work

12

Integer Linear ProgrammingInteger Linear Programming

Objective function

Linear constraints

13

ILP FormulationILP Formulation

Linear constraintsNo forbidden nodesConvexity constraints I/O serialisation based constraints I/O access per cycle based constraints

Objective functionSaving in cycles should be maximum

14

ISE Selection Constraints (1 of 2)ISE Selection Constraints (1 of 2) Variable: For each node ni a Boolean variable xi

xi is true iff node ni is in the selected ISE

Constraint: No forbidden node should be in the ISE If ni is a forbidden node, then xi = 0

Variable: For each node ni two Boolean variables pi and si

pi (si) is true iff at least a predecessor (successor) of ni is in the selected ISE

Constraint: Subgraph corresponding to the selected ISE must be convex If (pi and si are true), then xi must be true (i.e., pi + si – xi ≤

1)

15

ISE Selection Constraints (2 of 2)ISE Selection Constraints (2 of 2)

Relationship between pi, si and xi

pi = 0 if ni has no children

U (xj U pj) where nj’s are children of ni

si = 0 if ni has no parents

U (xj U pj) where nj’s are parents of ni

16

I/O Serialisation Based Constraints (1 I/O Serialisation Based Constraints (1 of 3)of 3)

n1 n2

n3

n4

n5

Variable: An integer variable intDelayi

Denotes the cycle in which node ni is executed, e.g.,

intDelay1 = 0 intDelay4 = 1 intDelay5 = 2

Variable: A real variable fractionalDelayi Denotes the smallest time after

intDelayi cycle when output of ni are available, e.g.,

fractionalDelay3 = HW (n3) fractionalDelay4 = HW (n3) + HW (n4)

Variable: An integer variable ρij Denotes the number of stages across

the edges between the nodes ni and nj , e.g.,

ρ13 = 1 ρ34 = 0 ρ25 = 2

17


Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay4 = intDelay3 +

ρ34

intDelay5 = intDelay2 + ρ25

Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay5 + ρ57 R = intDelay2 + ρ26

n1 n2

n3

n4

n5

n6n7

Extra latches on output edges are createdin order to realize an imaginary sink node

18


Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node

in the cycle fractionalDelay3 = HW (n3)

Case 2: if node is not the first node in the cycle

fractionalDelay4 = fractionalDelay3 + HW (n4)

Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay3 ≤ λ fractionalDelay4 ≤ λ

n1 n2

n3

n4

n5

n6n7

19

I/O Access Per Cycle Based I/O Access Per Cycle Based Constraints Constraints

Variable: Boolean variables cikIN and cik

OUT

cikIN is true, iff ni is an input of ISE and is accessed in the

kth stage of execution (similarly for cikOUT)

Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ cik

IN ≤ m

∑ cikOUT ≤ n

cikIN and cik

OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU

20

Objective FunctionObjective Function

Saving in cycles should be maximized SW (S) – HW (S) should be maximum

SW (S) = ∑ xi SW (ni)

HW (S) = R

Any processor model where SW (S) and HW (S) can becomputed using linear inequalities, can be handled using ILP

21

Experimental SetupExperimental Setup

Input dataflowgraph

ISE selectionAtasu03

ISE selectionAtasu03

ILP method

I/O serialisationPozzi05

No serialisation

exp / subopt

exp / opt

22

Results (1 of 3)Results (1 of 3)

viterbi

adpcmdecoder adpcmcoder

No pipelining

Pozzi’s algorithm

ILP method

23


Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results

Benchmark: aes

Biggest dataflow graph: 703

After 3 minutes After an hour

24


The best AFU with 22 inputs and 22 outputs

25

ConclusionsConclusions

ISE Selection I/O Serialisation

Atasu03

Yu07

Chen07

Bonzini07

Pozzi05

Pothineni07

The methodology can be generalized for a large class of processor models

Optimal, single run algorithm

Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

Documents