Lower Power High Level Synthesis. 1999. 8 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr. System Partitioning. To decide which components of the system will be realized in hardware and which will be implemented in software - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lower Power Synthesis1999. 8
http://vada.skku.ac.kr
VADA Lab.
System Partitioning
To decide which components of the system will be realized in
hardware and which will be implemented in software
High-quality partitioning is critical in high-level synthesis. To
be useful, high-level synthesis algorithms should be able to handle
very large systems. Typically, designers partition high-level
design specifications manually into procedures, each of which is
then synthesized individually. Different partitionings of the
high-level specifications may produce substantial differences in
the resulting IC chip areas and overall system performance.
To decide whether the system functions are distributed or not.
Distributed processors, memories and controllers can lead to
significant power savings. The drawback is the increase in area.
E.g., a non-distributed and a distributed design of a vector
quantizer.
VADA Lab.
Circuit Partitioning
account by letting the weight of a node (i,j)
in Nc be the sum of the weights of the
nodes I and j. We can similarly take edge
weights into account by letting the weight
of an edge in Ec be the sum of the weights
of the edges "collapsed" into it.
Furthermore, we can choose the edge (i,j)
which matches j to i in the construction of
Nc above to have the large weight of all
edges incident on i; this will tend to
minimize the weights of the cut edges. This
is called heavy edge matching in METIS,
and is illustrated on the right.
5439.psd
VADA Lab.
Multilevel Kernighan-Lin
Given a partition (Nc+,Nc-) from step (2) of Recursive_partition,
it is easily expanded to a partition (N+,N-) in step (3) by
associating
with each node in Nc+ or Nc- the nodes of N that comprise it. This
is again shown below:
Finally, in step (4) of Recurive_partition, the approximate
partition from step (3) is improved using a variation of
Kernighan-Lin.
5440.psd
WRITE VHDL
Minimizing switching activity in Register
Minimizing switching activity in resource and interconnection
LOW POWER AND FAST SCHEDULING
VADA Lab.
Instructions
Operations
Variables
Arrays
signals
@(posedge clk);
VADA Lab.
High-Level Synthesis
The allocation task determines the type and quantity of resources
used in the RTL design. It also determines the clocking scheme,
memory hierarchy and pipelining style. To perform the required
trade-offs, the allocation task must determine the exact area and
performance values.
The scheduling task schedules operations and memory references into
clock cycles. If the number of clock cycles is a constraint, the
scheduler has to produce a design with the fewest functional
units
The binding task assigns operations and memory references within
each clock cycle to available hardware units. A resource can be
shared by different operations if they are mutually exclusive, i.e.
they will never execute simultaneously.
VADA Lab.
- -
Sibling [ Fang , 96 ]
correlation resource sharing [ Gebotys, 97 ]
FU shut down (Demand-driven operation) [ Alidina, 94 ]
[ Rabaey, 96 ]
Dual [ Sarrafzadeh, 96 ]
Spurious [ Hwang, 96 ]
+ [Cho,97]
- -
- -
+1
+2
+3
*1
a
b
c
d
e
g
f
h
VADA Lab.
where PREG is the power of the registers
PMUX is the power of multiplexers
PFU is the power of functional units
PINT is the power of physical interconnet capacitance
VADA Lab.
High-Level Power Estimation: PREG
Compute the lifetimes of all the variables in the given VHDL
code.
Represent the lifetime of each variable as a vertical line from
statement i through statement i + n in the column j reserved for
the corresponding varibale v j .
Determine the maximum number N of overlapping lifetimes computing
the maximum number of vertical lines intersecting with any
horizontal cut-line.
Estimate the minimal number of N of set of registers necessary to
implement the code by using register sharing. Register sharing has
to be applied whenever a group of variables, with the same
bit-width b i .
Select a possible mapping of variables into registers by using
register sharing
Compute the number w i of write to the variables mapped to the same
set of registers. Estimate n i of each set of register dividing w i
by the number of statements S: i =wi/S; hence TR imax = n i f clk
.
Power of latches and flip flops is consumed not only during output
transitions, but also during all clock edges by the internal clock
buffers
The non-switching power PNSK dissipated by internal clock buffers
accounts for 30% of the average power for the 0.38-micron and 3.3 V
operating system.
In total,
VADA Lab.
PCNTR
After scheduling, the control is defined and optimized by the
hardware mapper and
further by the logic synthesis process before mapping to
layout.
Like interconnect, therefore, the control needs to be estimated
statistically.
Local control model: the local controller account for a larger
percentage of the total capacitance than the global
controller.
Where Ntrans is the number of tansitions, Nstates is the number of
states, Clc is the capacitance switched in any local controller in
one sample period and Bf is the ratio of the number of bus accesses
to the number of busses.
Global control model
Ntrans
The number of transitions depends on assignment, scheduling,
optimizations, logic optimization, the standard cell library used,
the amount of glitchings and the statistics of the inputs.
VADA Lab.
VADA Lab.
(a)
(b)
VADA Lab.
The coarse-grained model provides a fast estimation of the power
consumption when
no information of the activity of the input data to the functional
units is available.
VADA Lab.
Fine-grained model
When information of the activity of the input data to the
functional units is available.
VADA Lab.
of an 8 X 8-bit Booth multiplier.
AHD
VADA Lab.
Loop Interchange
If matrix A is laid out in memory in column-major form,
execution
order (a.2) implies more cache misses than the execution order in
(b.2).
Thus, the compiler chooses algorithm (b.1) to reduce the running
time.
VADA Lab.
Motion Estimation
VADA Lab.
Retiming
Flip- flop insertion to minimize hazard activity moving a flip-
flop in a circuit
VADA Lab.
power reduction
By sampling a steady state signal at a register input,
no more glitches are propagated through the next
combinational logics.
VADA Lab.
Common patterns enable the design of less complex architecture and
therefore simpler interconnect structure (muxes, buffers, and
buses). Regular designs often have less control hardware.
VADA Lab.
Module Selection
Select the clock period, choose proper hardware modules for all
operations(e.g., Wallace or Booth Multiplier), determine where to
pipeline (or where to put registers), such that a minimal hardware
cost is obtained under given timing and throughput
constraints.
Full pipelining: ineffective clock period mismatches between the
execution times of the operators. performing operations in sequence
without immediate buffering can result in a reduction of the
critical path.
Clustering operations into non-pipelining hardware modules, the
reusability of these modules over the complete computational graph
be maximized.
During clustering, more expensive but faster hardware may be
swapped in for operations on the critical path if the clustering
violates timing constraints
VADA Lab.
Estimate min and max bounds on the required resources to
delimit the design space min bounds to serve as an initial
solution
serve as entries in a resource utilization table which guides the
transformation, assignment and scheduling operations
Max bound on execution time is tmax: topological ordering of DFG
using ASAP and ALAP
Minimum bounds on the number of resources for each resource
class
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
VADA Lab.
Find the minimal area solution constrained to the timing
constraints
By checking the critical paths, it determine if the proposed graph
violates the timing constraints. If so, retiming, pipelining and
tree height reduction can be applied.
After acceptable graph is obtained, the resource allocation process
is
initiated.
transform the graph to reduce the hardware requirements.
Use a rejectionless probabilistic iterative search technique (a
variant of Simulated Annealing), where moves are always accepted.
This approach reduces computational complexity and gives faster
convergence.
VADA Lab.
Scheduling and Binding
The scheduling task selects the control step, in which a given
operation will happen, i.e., assign each operation to an execution
cycle
Sharing: Bind a resource to more than one operation.
Operations must not execute concurrently.
Graph scheduled hierachically in a bottom-up fashion
Power tradeoffs
Schedule directly impacts resource sharing
Energy consumption depends what the previous instruction was
Reordering to minimize the switching on the control path
Clock selection
Eliminate slacks
VADA Lab.
ASAP Scheduling
Mechanical analogy:
Probability of scheduling operations into control steps
Probability of scheduling operations into control steps after
operation o3 is scheduled to step s2
Operator cost for multiplications in a
Operator cost for multiplications in c
3445.psd
3446.psd
3447.psd
3448.psd
ready operation list/resource constraint
Priority list
Decompose a computation into strongly connected components
Any adjacent trivial SCCs are merged into a sub part;
Use pipelining to isolate the sub parts;
For each sub part
If (the sub part is linear)
Apply optimal unfolding;
Merge linear sub parts to further optimize;
Schedule merged sub parts to minimize memory usage
VADA Lab.
SCC decomposition step
Using the standard depth-first search-based algorithm [Tarjan,1972]
which has a low order polynomial-time complexity. For any pair of
operations A and B within an SCC, there exist both a path from A to
B and a path from B to A. The graph formed by all the SCCs is
acyclic. Thus, the SCCs can be isolated from each other using
pipeline delays, which enables us to optimize each SCC
separately.
VADA Lab.
Idetifying SCC
The first step of the approach is to identify the computation's
strongly connected components,.
VADA Lab.
3456.psd
Pipeline operations inside a loop.
Overlap execution of operations.
Use pipeline scheduling for loop graph model.
VADA Lab.
DFG Restructuring
3462.psd
3463.psd
VADA Lab.
Control Synthesis
VADA Lab.
Optimum binding
VADA Lab.
Resource sharing can destroy signal correlations and increase
switching activity, should be done between operations that are
strongly connected.
Map operations with correlated input signals to the same
units
Regularity: repeated patterns of computation (e.g., (+, * ), ( *
,*), (+,>)) simplifying interconnect (busses, multiplexers,
buffers)
VADA Lab.
Datapath interconnections
Multiplexer-oriented datapath
Bus-oriented datapath
3466.psd
Insertion of Latch (out)
Insertion of latches at the output ports of the functional
units
3467.psd
Insertion of Latch (in/out)
Insertion of latches at both the input and output ports of the
functional units
3468.psd
3469.psd
Overlapping data transfer with functional-unit execution
3470.psd
Scheduled DFG
Graph model
3475.psd
3476.psd
VADA Lab.
Register Allocation
Allocation : bind registers and functional modules to variables and
operations in the CDFG and specify the interconnection among
modules and registers in terms of MUX or BUS.
Reduce capacitance during allocation by minimizing the number of
functional modules, registers, and multiplexers.
Register allocation and variable assignment can have a profound
impact on spurious switching activity in a circuit
Judicious variable assignment is a key factor in the elimination of
spurious operations
VADA Lab.
Effect of Register Sharing on FU
For a small increase in the number of registers, it is possible to
significantly reduce or eliminate spurious operations
For Design 1, synthesized from Assignment 1, the power consumption
was 30.71mW
For Design 2, synthesized from Assignment 2, the power consumption
was 18.96mW @1.2-m standard-cell library
VADA Lab.
Operand sharing
To schedule and bind operations to functional units in such a way
that the activity of the input operands is reduced.
Operations sharing the same operand are bound to the same
functional unit and scheduled in such a way that the function unit
can reuse that operand.
we will call operand reutilization (OPR) the fact that an operand
is reused by two operations consecutively executed in the same
functional unit.
VADA Lab.
VADA Lab.
Loop unrolling
The technique of loop unrolling replicates the body of a loop some
number of times (unrolling factor u) and then iterates by step u
instead of step 1. This transformation reduces the loop overhead,
increases the instruction parallelism and improves register, data
cache or TLB locality.
VADA Lab.
Loop Unrolling Effects
Loop overhead is cut in half because two iterations are performed
in each iteration.
If array elements are assigned to registers, register locality is
improved because A(i) and A(i +1) are used twice in the loop
body.
Instruction parallelism is increased because the second assignment
can be performed while the results of the rst are being stored and
the loop variables are being updated.
VADA Lab.
Loop Unrolling (IIR filter example)
loop unrolling : localize the data to reduce the activity of the
inputs of the functional units or two output samples are computed
in parallel based on two input samples.
Neither the capacitance switched nor the voltage is altered.
However, loop unrolling enables several other transformations
(distributivity, constant propagation, and pipelining). After
distributivity and constant propagation,
The transformation yields critical path of 3, thus voltage can be
dropped.
VADA Lab.
VADA Lab.
VADA Lab.
VADA Lab.
obtaining a reduction of 9.4%.
VADA Lab.
Operand Retaining
An idle functional unit may have input operand changes because of
the variation of the selection signals of multiplexers.
The operand-retaining technique attempts to minimize the useless
power consumption of the idle functional units.
VADA Lab.
Minimize the useless power consumption of idle units
(a) with a proper register binding that minimizes the activity of
the functional units
(b) by wisely defining the control signals of the multiplexors
during the idle cycles in such a way that the changes at the inputs
of the functional units are minimized (this may result in defining
some of the don't care values of the control signals) [RJ94]
(c) latching the operands of those units that will be often
idle.
VADA Lab.
(C) latching the operands
It consists of the insertion of latches at the inputs of the
functional units to store the operands only when the unit requires
them. Thus, in those cycles in which the unit is idle no
consumption in produced.
The control unit has to be redesigned accordingly, in such a way
that input latches become transparent during those cycles in which
the corresponding functional unit must execute an operation.
Similar to putting the functional units to sleep when they are not
needed through gated-clocking strategies [CSB94,BSdM94,
AMD94].
VADA Lab.
LMS filter Example
The power consumption generated by the idle units (useless
consumption) is:
the power consumption because of the useful calculations (useful
consumption) is:
The estimated reduction in power consumption is:
reduction of 34%.
interconnect power reduction
A spatially local cluster: group of algorithm operations that are
tightly connected to each other in the flowgraph
representation.
Two nodes are tightly connected to each other on the flowgraph
representaion if the shortest distance between them, in terms of
number of edges traversed, is low.
A spatially local assignment is a mapping of the algorithm
operations to specific hardware units such that no operations in
different clusters share the same hardware.
Partitioning the algorithm into spatially local clusters ensures
that the majority of the data transfers take place within clusters
(with local bus) and relatively few occur between clusters (with
global bus).
The partitioning information is passed to the architecture netlist
and floorplanning tools.
Local: A given adder outputs data to its own inputs Global: A given
adder outputs data to the aother adder's inputs
VADA Lab.
Hardware Mapping
The last step in the synthesis process maps the allocated, assigned
and scheduled flow graph (called the decorated flow graph) onto the
available hardware blocks.
The result of this process is a structural description of the
processor architecture, (e.g., sdl input to the Lager IV silicon
assembly environment).
The mapping process transforms the flow graph into three structural
sub-graphs:
the data path structure graph
the controller state machine graph
the interface graph (between data path control inputs and the
controller output signals)
Spectral Partitioning in High-Level Synthesis
The eigenvector placement obtained forms an ordering in which nodes
tightly connected to each other are placed close together.
The relative distances is a measure of the tightness of
connections.
Use the eigenvector ordering to generate several partitioning
solutions
The area estimates are based on distribution graphs.
A distribution graph displays the expected number of operations
executed in each time slot.
Local bus power: the number of global data transfers times the area
of the cluster
Global bus power: the number of global data transfer times the
total area:
VADA Lab.
VADA Lab.
Interconnection Estimation
For connection within a datapath (over-the-cell routing), routing
between units increases the actual height of the datapath by
approximately 20-30% and that most wire lengths are about 30-40% of
the datapath height.
Average global bus length : square root of the estimated chip
area.
The three terms represent white space, active area of the
components, and wiring area. The coefficients are derived
statistically.
VADA Lab.
Datapath Generation
Register file recognition and the multiplexer reduction:
Individual registers are merged as much as possible into register
files
reduces the number of bus multiplexers, the overall number of
busses (since all registers in a file share the input and output
busses) and the number of control signals (since a register file
uses a local decoder).
Minimize the multiplexer and I/O bus, simultaneously (clique
partitioning is Np-complete, thus Simulated Annealing is
used)
Data path partitioning is to optimize the processor floorplan
The core idea is to grow pairs of as large as possible isomorphic
regions from corresponding of seed nodes.
VADA Lab.
( input ) : V,
( output ) : V
Cij=1, Bij=L
Cij=1, Bij=W
Wa :
N :
M : Wa * N
L : W
V : )
- -
Minimize Z=
1 : flow = 0;
2 : Bij* .
3 : 2 S T flow .
= min (1, 2)
2 = min{Xij}
4: 2 .
5: V .
- -
- -
PATH 2 : S-b-T REG2 : b
PATH 3 : S-c-g-T REG3 : c, g
PATH 4 : S -d-T REG4 : d
- -
(a,e) +(b, b),
- -
( input ) : V, Network for resource sharing
(output ) : V
Cij=1, Bij=L
Cij=1, Bij=W
Wmux :
K :
V :
- -
PATH 2 :
- -
CDFG
exclusive-OR
- -
,
15%
(polynomial time and optimal solution algorithm)
TOP-DOWN
(DSP, Microcontroller, ASIC, etc)
- -
Programmable DSP bit-swapping .
DSP / .
ALU
VADA Lab.
Bus switchings/transition bus-based system power dissipation .
off-chip power consumption chip power 70%
CPU ALU Datapath . ALU bus signal transition bus ALU
VADA Lab.
mux-in1
reg-in1
mux-in2
reg-in2
reg-in3
ALU
databus-in1
databus-in2
C1 = Y1(A2) EXOR Y2(B2)
bit-wise swapping a bit inY2(A2) and a bit in Y2(B2) when a bit in
C1 is “1”
VADA Lab.
.
simulation testvector test.
VADA Lab.
[1] D. Gajski and N. Dutt, High-level Synthesis : Introduction to
Chip and System Design. Kluwer Academic Publishers, 1992.
[2] G. D. Micheli, Synthesis and Optimization of Digital Circuits.
New York : McGraw Hill. Inc, 1994.
[3] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-Power
CMOS digital design", IEEE J. of Solid-State Circuits, pp. 473-484,
1992.
[4] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R.
W. Brodersen, "Optimizing power using transformation," IEEE Tr. on
CAD/ICAS, pp. 12-31, Jan. 1995.
[5] E. Musool and J. Cortadella, "Scheduling and resource binding
for low power", Int'l Symp on Synstem Syntheiss, pp. 104-109, Apr.
1995.
[6] Y. Fang and A. Albicki, "Joint scheduling and allocation for
low power," in Proc. of Int'l Symp. on Circuits & Systems, pp.
556-559, May. 1996.
[7] J. Monteiro and Pranav Ashar, "Scheduling techniques to enable
power management", 33rd Design Automation Conference, 1996.
[8] R. S. Martin, J. P. Knight, "Optimizing Power in ASIC
Behavioral Synthesis", IEEE Design & Test of Computers, pp.
58-70, 1995.
[9] R. Mehra, J. Rabaey, "Exploting Regularity for Low Power
Design", IEEE Custom Integrated Circuits Conference, pp.177-182.
1996.
[10] A. Chandrakasan, T. Sheng, and R. W. Brodersen, "Low Power
CMOS Digital Design", Journal of Solid State Circuits, pp. 473-484,
1992.
[11] R. Mehra and J. Rabaey, "Behavioral level power estimation and
exploration," in Proc. of Int'l Symp. on Low Power Design, pp.
197-202, Apr. 1994.
[12] A. Raghunathan and N. K. Jha, "An iterative improvement
algorithm for low power data path synthesis," in Proc. of Int'l
Conf. on Computer-Aided Design, pp. 597-602, Nov. 1995.
[13] R. Mehra, J. Rabaey, "Low power architectural synthesis and
the impact of exploiting locality," Journal of VLSI Signal
Processing, 1996.
VADA Lab.
[14] M. B. Srivastava, A. P. Chandrakasan, and R. W. Brodersen,
"Predictive system shutdown and other architectural techniques for
energy efficient programmable computation," IEEE Tr. on VLSI
Systems, pp. 42-55, Mar. 1996.
[15] A. Abnous and J. M. Rabaey, "Ultra low power domain specific
multimedia processors," in Proc. of IEEE VLSI Signal Processing
Workshop, Oct. 1996.
[16] M. C. Mcfarland, A. C. Parker, R. Camposano, "The high level
synthesis of digital systems," Proceedings of the IEEE. Vol 78. No
2 , February, 1990.
[17] A. Chandrakasan, S. Sheng, R. Brodersen, "Low power CMOS
digital design,", IEEE Solid State Circuit, April, 1992.
[18] A. Chandrakasan, R. Brodersen, "Low power digital CMOS design,
Kluwer Academic Publishers, 1995.
[19] M. Alidina, J. Moteiro, S. Devadas, A. Ghosh, M.
Papaefthymiou, "Precomputation based sequential logic optimization
for low power," IEEE International Conference on Computer Aided
Design, 1994.
[20] J. Monterio, S. Devadas and A. Ghosh, "Retiming sequential
circuits for low power," In Proceeding of the IEEE International
Conference on Computer Aided Design, November, 1993.
[21] F. J. Kurdahi, A. C. Parker, REAL: A Program for Register
Allocation,: in Proc. of the 24th Design Automation Conference,
ACM/IEEE, June. pp. 210-215, 1987.
[22] A. Wolfe. A case study in low-power system level design. In
Proc.of the IEEE International Conference on Computer Design, Oct.,
1995.
[23] T.D. Burd and R.W. Brothersen. Energy ecient CMOS
micropro-cessor design. In Proc. 28th Annual Hawaii International
Conf. On System Sciences, January 1995.
[24] A. Dasgupta and R. Karri. Simultaneous scheduling and binding
for power minimization during microarchitectural synthesis. In Int.
Symposium on Low Power Design, pages 69-74, April 1995.
[25] R.S. Martin. Optimizing power consumption, area and delay in
behavioral synthesis. PhD thesis, Department of Electronics,
Faculty of Enginering, Carleton University, March 1995.
[26] A. Matsuzawa. Low-power portable design. In Proc.
International Symposium on Advanced Research in Asynchronous
Circuits and Systems, March 1996. Invited lecture.
[27] J.D. Meindl. Low-power microelectronics: retrospect and
prospect. Proceedings of the IEEE 83(4):619-635, April 1995.
P
P
P
P
P
P
P
P
P
a
add
abs
b
add
add
abs
abs
add
+
+
+
+
>
-