Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL) csda csda Fast, Quasi-Optimal, and Fast, Quasi-Optimal, and Pipelined Instruction-Set Pipelined Instruction-Set Extensions Extensions
25
Embed
Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions
csda. csda. Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions. Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL). Custom ISE Identification. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Ajay K. Verma, Philip Brisk and Paolo Ienne
Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)
Ecole Polytechnique Fédérale de Lausanne (EPFL)
csda
csda
Fast, Quasi-Optimal, and Pipelined Fast, Quasi-Optimal, and Pipelined Instruction-Set ExtensionsInstruction-Set Extensions
2
Custom ISE IdentificationCustom ISE Identification
Register File
ALU MUL LD/ST
Data Memory
AFUout1 = F (in1, in2, in3, in4)out2 = G (in1, in2, in3, in4)
Limited number ofI/O ports
3
OutlineOutline
Problem formulation ISE selection I/O serialisation
Related work
Non-optimality of earlier work
Integer Linear Programming (ILP) formulation
Results
Conclusions
4
Problem FormulationProblem Formulation Given
a dataflow graph
a set of forbidden nodes
Find a subgraph S, which isconvex free of
forbidden nodes
And, has largest gainM (S) =
Nexec * (SW (S) – HW (S))
f
a
x2
x1 d
x3
h
b c e g
5
Convex SubgraphConvex Subgraph
d
cb
a
In order to execute the AFU we need the output of node b
Computation of node b requires the output of AFU
A non-convex AFU cannot be scheduled without creating a deadlock
6
I/O SerialisationI/O Serialisation
f
d
b c e
2 inputs, 4 outputsAvailable I/O ports: (1, 2)
cb
e
d
f
7
ISE Merit EstimationISE Merit Estimation
M (S) = Nexec * (SW (S) – HW (S))
f
a
x2
x1 d
x3
h
b c e g
cb
e
d
f
8
Related WorkRelated Work ISE identification under I/O constraints
Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07]
ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07]
ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs
[Pothineni07]
I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07]
Algorithms for specific processor models Single-issue RISC processor model [Verma07]
Variable: An integer variable ρij Denotes the number of stages across
the edges between the nodes ni and nj , e.g.,
ρ13 = 1 ρ34 = 0 ρ25 = 2
17
I/O Serialisation Based Constraints (2 I/O Serialisation Based Constraints (2 of 3)of 3)
Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay4 = intDelay3 +
ρ34
intDelay5 = intDelay2 + ρ25
Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay5 + ρ57 R = intDelay2 + ρ26
n1 n2
n3
n4
n5
n6n7
Extra latches on output edges are createdin order to realize an imaginary sink node
18
I/O Serialisation Based Constraints (3 I/O Serialisation Based Constraints (3 of 3)of 3)
Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node
in the cycle fractionalDelay3 = HW (n3)
Case 2: if node is not the first node in the cycle
fractionalDelay4 = fractionalDelay3 + HW (n4)
Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay3 ≤ λ fractionalDelay4 ≤ λ
n1 n2
n3
n4
n5
n6n7
19
I/O Access Per Cycle Based I/O Access Per Cycle Based Constraints Constraints
Variable: Boolean variables cikIN and cik
OUT
cikIN is true, iff ni is an input of ISE and is accessed in the
kth stage of execution (similarly for cikOUT)
Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ cik
IN ≤ m
∑ cikOUT ≤ n
cikIN and cik
OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU
20
Objective FunctionObjective Function
Saving in cycles should be maximized SW (S) – HW (S) should be maximum
SW (S) = ∑ xi SW (ni)
HW (S) = R
Any processor model where SW (S) and HW (S) can becomputed using linear inequalities, can be handled using ILP
21
Experimental SetupExperimental Setup
Input dataflowgraph
ISE selectionAtasu03
ISE selectionAtasu03
ILP method
I/O serialisationPozzi05
No serialisation
exp / subopt
exp / opt
22
Results (1 of 3)Results (1 of 3)
viterbi
adpcmdecoder adpcmcoder
No pipelining
Pozzi’s algorithm
ILP method
23
Results (2 of 3)Results (2 of 3)
Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results
Benchmark: aes
Biggest dataflow graph: 703
After 3 minutes After an hour
24
Results (3 of 3)Results (3 of 3)
The best AFU with 22 inputs and 22 outputs
25
ConclusionsConclusions
ISE Selection I/O Serialisation
Atasu03
Yu07
Chen07
Bonzini07
Pozzi05
Pothineni07
The methodology can be generalized for a large class of processor models