Glitch Reduction and CAD Algorithm Noise in FPGAs by Warren Shum A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c ⃝ 2011 by Warren Shum
94
Embed
Glitch Reduction and CAD Algorithm Noise in FPGAs · Glitch Reduction and CAD Algorithm Noise in FPGAs Warren Shum ... First, a study of glitch power in a commercial FPGA is presented,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Glitch Reduction and CAD Algorithm Noise in FPGAs
by
Warren Shum
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c⃝ 2011 by Warren Shum
Abstract
Glitch Reduction and CAD Algorithm Noise in FPGAs
Warren Shum
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2011
This thesis presents two contributions to the FPGA CAD domain. First, a study of
glitch power in a commercial FPGA is presented, showing that glitch power in FPGAs is
significant. A CAD algorithm is presented that reduces glitch power at the post-routing
stage by taking advantage of don’t-cares in the logic functions of the circuit. This method
comes at no cost to area or performance.
The second contribution of this thesis is a study of FPGA CAD algorithm noise –
random choices which can have an unpredictable effect on the circuit as a whole. An
analysis of noise in the logic synthesis, technology mapping, and placement stages is
presented. A series of early performance and power metrics is proposed, in an effort to
find the best circuit implementation in the noise space.
ii
Acknowledgements
First and foremost, I would like to thank Professor Jason Anderson for supervising my
thesis research, and for guiding me along with good ideas and encouragement. I would
also like to thank Professors Jonathan Rose, Vaughn Betz, and Olivier Trescases, for
reviewing this work and serving on my defence committee.
I am also grateful to my parents, for supporting me in all my academic endeavors.
Thanks to my fellow research group members: Marcel, Bill, Jason L., James, Andrew,
Mark, Steven, Ahmed, Victor, Stefan, Alex, Kevin, and my office mates in PT477. I
appreciate the feedback on my work, the sporting activities, as well as just sharing
conversation.
Thanks to the staff at SciNet for their technical support.
I thank NSERC and OGS for financial support throughout my degree.
Field-programmable gate arrays (FPGAs) are user-configurable logic devices capable of
implementing digital circuits. These devices are used in a wide variety of areas including
communications, automotive, industrial and consumer markets. The appeal of FPGAs
versus application-specific integrated circuits (ASICs) is that they allow the user to avoid
the high cost of chip fabrication, as well as they reduce time-to-market. FPGAs allow
a hardware designer to prototype their design quickly, while an ASIC design would take
more time and money to repair, should an error be found. Mask set costs at 45nm can
cost as much as $2M [Fran 10], a cost high enough to drive away all but the highest-
volume applications.
To create an FPGA implementation of a design, a hardware engineer will typically use
a hardware description language (HDL), such as Verilog or VHDL. A series of computer-
aided design (CAD) tools transform the HDL into a digital circuit that can be pro-
grammed onto the FPGA. A typical sequence of steps in the CAD flow is as follows:
• Logic Synthesis: The logic functions needed to implement the circuit are derived
and optimized.
1
Chapter 1. Introduction 2
• Technology Mapping: The logic functions are mapped into the logic elements
specific to the target device architecture.
• Packing: The logic elements are grouped into larger units corresponding to the
target device architecture.
• Placement: The mapped logic elements are placed into physical locations on the
target device.
• Routing: The proper connections are made between the logic elements using the
programmable routing network.
The quality of the resulting circuit depends on the quality of the tools used to generate
it. Quality can be measured in terms of area, performance and power. It is here that
FPGAs fall short of ASICs – the area, performance and dynamic power gaps between
them have been estimated at 40x, 4x and 12x, respectively [Kuon 07]. By studying
existing CAD algorithms and exploring new ones, FPGAs can close the gap with ASICs
and attract a larger portion of the digital logic market.
1.2 Glitch Power
As mentioned previously, one area for improvement in FPGAs is power consumption.
Power can be reduced through efforts at various stages: the architectural level, the
circuit level, or the CAD level (which will be the focus here). In particular, glitch
power (the power dissipated by unnecessary signal transitions) is an attractive target for
reduction since it comprises from 4% to 73% of total dynamic power, with an average of
22.6% [Lamo 08]. We present two contributions in this area, the results of which have
been published [Shum 11]:
1. An analysis of glitch power in commercial FPGAs.
Chapter 1. Introduction 3
2. A CAD approach for reducing glitch power at no cost to area or performance.
Chapter 2 provides background on FPGA glitch power. It begins with a description
of how glitches occur in FPGAs, and some previous works on how to reduce glitch
power. To motivate our research, we present our own analysis on glitch power
in commercial FPGAs. Our results show an average of 26% of dynamic power
from glitches. This chapter also describes don’t-cares in logic functions, which will
be used in the glitch reduction algorithm. We show that the average occurrence
of don’t-cares under simulation is sufficient to supply ample opportunities for our
algorithm.
Chapter 3 presents an algorithm for glitch power reduction which can be performed
post-routing, incurring zero area and performance cost. The algorithm takes ad-
vantage of don’t-care bits in the truth tables of functions in a circuit, setting them
to values which minimize the amount of glitch power dissipated. The algorithm is
tested with a commercial FPGA CAD tool suite and architecture, and shows an
average glitch power reduction of 13.7%, and an average dynamic power reduction
of 4.0%.
1.3 CAD Algorithm Noise
Given the tremendous challenge of solving modern-day CAD problems, the algorithms
used for these problems generally use heuristics to seek a reasonable solution in an ac-
ceptable amount of time. In the course of exploring the vast solution space of these
problems, there is often a need to choose between two or more alternatives that appear
to have the same quality. Such choices, although seemingly innocuous at the time of
selection, can have ripple effects on future choices, causing the final quality of the circuit
to vary if different choices are made. We label these variations as noise. We present the
following contributions in this area:
Chapter 1. Introduction 4
1. An analysis of a series of logic synthesis and technology mapping algorithms, ex-
posing potential sources of noise that have not been studied before.
2. Experimental results on the amount of noise present in several CAD algorithms, in
terms of critical path delay and dynamic power. The concept of power noise is also
a new contribution which has not been previously studied.
3. A method for predicting the best circuits in terms of performance and power in the
presence of noise.
Chapter 4 introduces the concept of CAD algorithm noise. We expose hidden sources
of noise in the logic synthesis and technology mapping algorithms of the academic
CAD tool ABC [Berk 06]. We present the results of our noise analysis, showing
the effects of random choices in thousands of circuit compilations. The results of
the noise injection show a standard deviation of as much as 3.3% in critical path
delay, and 3.7% in dynamic power.
Chapter 5 presents a solution to the variance in circuit quality produced by CAD algo-
rithm noise. The idea is to perform several synthesis and mapping runs of a circuit
(using different seeds) and use early timing and power metrics to predict the best
one(s) to advance to the placement and routing stages. This would save the time
that would be spent on a large number of place-and-route runs. In this chapter, a
wide array of early timing prediction models are evaluated, including several ap-
proaches to estimating logic and routing delays. For power prediction, two fast
simulation models are used, as well as information from the packing stage of the
CAD flow. The application of these prediction models in a commercial FPGA leads
to an average benefit of up to 1.8% in delay and 1.8% in power compared to the
average noise-injected circuit.
Chapter 6 concludes the work. We summarize the contributions of the previous chap-
Chapter 1. Introduction 5
ters and present possible extensions and related research topics for future work.
Chapter 2
Glitch Power and Don’t-Cares in
FPGAs
2.1 Introduction
Power in FPGAs can be divided into two categories: static power and dynamic power.
Static power is due to current leakage in transistors. Dynamic power is a result of signal
transitions between logic-0 and logic-1. These transitions can be split into two types:
functional transitions and glitches. Functional transitions are those which are necessary
for the correct operation of the circuit. Glitches, on the other hand, are transitions that
arise from unbalanced delays to the inputs of a logic gate, causing the gate’s output to
transition briefly to an intermediate state. Although glitches do not adversely affect the
functionality of a synchronous circuit (as they settle before the next clock edge), they
have a significant effect on power consumption. Using an academic FPGA model, glitch
power has been estimated to comprise from 4% to 73% of total dynamic power, with an
average of 22.6% [Lamo 08]. This is a significant motivator for the reduction of glitch
power.
As a means of reducing glitch power, we seek to take advantage of don’t-cares in
a circuit. Don’t-cares are an important concept in logic synthesis and are frequently
6
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 7
used for the optimization of logic circuits. A don’t-care of a logic function within a
larger circuit is an input state for which the function’s output can be either logic-0 or
logic-1, without affecting the circuit’s correctness. Don’t-cares can come from external
constraints or from within the circuit itself. An external constraint may be specified by
the designer (e.g. asserting that a certain input combination will never be applied). A
logic function within a circuit may also have don’t-cares due to its surrounding logic,
for example, if the logic feeding the function’s fanins can never satisfy a certain input
combination, or if the function’s output does not affect the circuit’s primary outputs
under certain circumstances.
This chapter is organized as follows. Section 2.2 gives a brief overview of basic FPGA
architecture. Section 2.3 describes how glitches occur in FPGAs. Section 2.4 summarizes
some previous works on FPGA glitch reduction. Section 2.5 describes don’t-cares and
how they can be found. Section 2.6 gives our analysis of glitch power, while Section 2.7
gives our analysis of don’t-cares. Section 2.8 summarizes the chapter.
2.2 FPGA Architecture
Before presenting our glitch analysis and glitch reduction method, it is important to recap
some basic FPGA architecture and terminology. Fig. 2.1(a) shows a section of a typical
island-style FPGA architecture. It is composed of logic blocks connected to one another
through a programmable routing network. Programmable routing switches (shown as x’s
in Fig. 2.1(a)) allow pins on logic blocks to be programmably connected to pre-fabricated
metal wire segments, and also allow wire segments to be programmably connected with
one another to form routing paths.
Inside the logic blocks, logic functions are implemented using look-up-tables (LUTs).
An example is shown in Fig. 2.1(b). A k-input LUT can implement any logic function of
up to k variables. In essence, a LUT is a hardware implementation of a truth table, where
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 8
(a) (b)
Figure 2.1: (a) Logic blocks and routing in an island-style FPGA architecture.(b) Example of a 3-input LUT (look-up-table) with truth table in Table 2.1.
the output value for each minterm is held in an SRAM configuration cell (bit). A k-input
LUT requires 2k configuration bits. For this work, we target an FPGA that contains 6-
input LUTs, which are typical of modern commercial FPGA architectures [Altec, Xili].
2.3 Glitch Power in FPGAs
The dynamic power consumed by an FPGA can be modeled by the formula
Pdyn =1
2
n∑i=1
SiCifV2dd (2.1)
where n is the number of nets in the circuit, Si is the switching activity of net i, Ci is
the capacitance of net i, f is the frequency of the circuit, and Vdd is the supply voltage.
The glitch reduction algorithm presented in this work aims to lower the switching activity
as a means of reducing dynamic power.
As a result of the differences in delays through the routing network and LUTs them-
selves, signals arriving at LUT inputs may transition at different times, leading to glitches.
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 9
Figure 2.2: Example waveform showing a glitch on the output of a LUT f with truthtable given in Table 2.1.
abc f Care000 0 Y001 0 Y010 0 Y011 0 Y100 1 N101 1 Y110 0 N111 0 Y
Table 2.1: Glitch example truth table for a logic function with inputs abc and output f .A possible example of cares is given (care = Y, don’t-care = N )
An example is shown in Fig. 2.2. This LUT implements the 3-input function given in
Table 2.1. Consider the case where the inputs transition from 000 → 111. Ideally, the
output f would remain constant at 0. However, varying arrival times on the inputs may
cause an input transition sequence such as 000→ 100→ 110→ 111, causing f to make
a 0 → 1 → 0 → 0 transition rather than remaining at 0. This leads to extra power
consumed by the LUT and any of its fanouts that propagate the glitch. Furthermore,
the glitch is propagated through the FPGA interconnect which presents a high capacitive
load due to its long metal wire segments and programmable (buffered) routing switches.
Prior work has shown, in fact, that interconnect accounts for 60% of total FPGA dynamic
power [Shan 02].
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 10
2.4 Previous Work on Glitch Reduction in FPGAs
Glitch reduction techniques can be applied at various stages in the CAD flow. Since
glitches are caused by unbalanced path delays to LUT inputs, it is natural to design
algorithms that attempt to balance the delays. This can be done at the technology
mapping stage [Chen 07b], in which the mapping is chosen based on glitch-aware switch-
ing activities. Another approach operates at the routing stage [Dinh 09], in which the
faster-arriving inputs to a LUT are delayed by extending their path through the rout-
ing network. Delay balancing can also be done at the architectural level. The work
in [Lamo 08] inserts programmable delay elements to balance the arrival times of signals
at LUT inputs. However, these approaches all incur an area or performance cost.
Some works use flip-flop insertion or pipelining to break up deep combinational logic
paths which are the root of high glitch power. Circuits with higher degrees of pipelining
tend to have lower glitch power because they have fewer logic levels, thus reducing the
opportunity for delay imbalance [Wilt 04]. Flip-flops with shifted-phase clocks can be
inserted to block the propagation of glitches [Lim 05]. Another work in [Czaj 07] uses
negative edge-triggered flip-flops in a similar fashion, but without the extra cost of gen-
erating additional clock signals. It is also possible to apply retiming to the circuit by
moving flip-flops to block glitches [Fisc 05].
Our work draws inspiration from hazard-free logic synthesis techniques for asyn-
chronous circuits, such as [Lin 95]. In asynchronous circuits, glitches (hazards) cannot
be tolerated because they may produce incorrect behavior (consider, for example, the
disasterous effect of a glitch on a handshaking signal). Our work is different in that while
hazards are tolerable from a functionality standpoint, it is beneficial to remove them to
reduce power consumption.
A key feature of the work presented here is that it has no impact on the rest of
the design flow. It is applied after placement and routing, and as a consequence, the
algorithm has no cost in terms of performance or area. Other methods incur additional
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 11
area/delay from the inclusion of delay elements, registers and extra routing resources, as
well as disrupting the synthesis and layout of the circuit in an unpredictable way. Our
approach maintains the results of the existing compilation while only making changes to
the don’t-cares within LUT truth table configuration bits. This zero-overhead method is
a highly desirable quality not shared by previous glitch reduction approaches.
2.5 Don’t-Cares in Logic Circuits
To prevent glitches, we take advantage of don’t-cares. These are entries in the truth
table where a LUT’s output can be set as either logic-0 or logic-1 without affecting the
correctness of the circuit. Don’t-cares fall into two categories: satisfiability don’t-cares
(SDCs) and observability don’t-cares (ODCs) [Mish 09]. SDCs occur when a particular
input pattern can never occur on the inputs to a LUT. In the example shown in Fig. 2.3(a),
the inputs a = 0, b = 1 will never occur. ODCs occur when the output of a LUT cannot
propagate to the circuit’s primary outputs. In the example, the output of f2 has no
effect when c = 0.
In this work, we leverage the don’t-care analysis capabilities of the ABC logic synthesis
network developed at UC Berkeley [Berk 06]. ABC incorporates Boolean satisfiability
(SAT)-based complete don’t-care analysis that can be used to determine the don’t-care
minterms for a given LUT in a technology mapped FPGA circuit [Mish 05]. To find the
don’t-cares for a given LUT, f , ABC uses a miter circuit, as illustrated in Fig. 2.3(b).
As shown, two instances of LUT f and (some of) its surrounding circuitry are created –
the surrounding circuitry is shown as a shaded region in the figure. In one instance, f ’s
output is in true form; in the other instance, f ’s output is inverted. The outputs of the
two instances are exclusive-OR’ed with one another, with the XOR gate outputs being fed
into a wide OR gate. The final OR gate produces an output logic signal C(x) for a given
input vector x.
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 12
(a) (b)
Figure 2.3: (a) Example of SDCs (left) and ODCs (right).(b) Miter circuit used in don’t-care analysis [Mish 05].
For an input vector x to the miter in Fig. 2.3(b), one can compute a local input
vector y to LUT f . For any such x where C(x) is logic-1, y is a care minterm of LUT f ;
that is, LUT f affects the circuit outputs for input vector x. The basic approach taken
in [Mish 05] is to use a fast vector-based simulation as well as SAT to find all vectors,
x, where C(x) evaluates to logic-1, yielding the complete care set for LUT f . This
provides a general picture of the don’t-care analysis approach and the reader is referred
to [Mish 05] for full details. Don’t-cares have recently been used for area reduction in
FPGA circuits [Mish 09].
2.6 Glitch Power Analysis
To motivate the need for glitch reduction, we examine the amount of glitch power dissi-
pated by 20 MCNC benchmark designs. These designs were fully compiled using Altera
Quartus 10.1, targeting 65nm Stratix III devices [Alteb]. ModelSim 6.3e was then used
to perform a functional (zero-delay) and timing simulation of each circuit using 5000 ran-
dom input vectors, producing two switching activity (VCD) files. The VCD files contain
a record of every transition of every net in the circuit. The dynamic power was then
computed using Quartus PowerPlay – Altera’s power analysis tool. The glitch filtering
setting was enabled, as it only filters glitches that are too short to occur in an actual
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 13
Table 2.3: Percentage of simulated local LUT input states corresponding to don’t-cares.
The percentages vary from 0.8% to 37.2%, with an average of 15.1%. This tells us that
not only do circuits contain an abundance of don’t-cares, but also that, surprising, these
don’t-cares are often traversed in circuit operation. In other words, a LUT’s don’t-care
minterms are frequently “visited” under vector stimulus. The visits to such don’t-care
minterms may potentially lead to additional unnecessary toggles on LUT outputs. We
can thus potentially reduce glitches through don’t-care settings, which is the core idea of
our approach (which will be described in the next chapter).
2.8 Conclusion
In this chapter, we introduced basic FPGA architecture and gave an introduction to
power consumption in FPGAs. We summarized some previous works in the area of glitch
reduction. We described how glitches are generated, and presented our own analysis of
glitch power consumption in commercial FPGAs. Glitch power was found to comprise
an average of 26.0% of total dynamic power. We also explained logical don’t-cares and
how they can be found, as well as analyzing how often they occur in circuits. It was
found that an average of 15.1% of visited LUT input states are don’t-cares. Together,
these results indicate that glitch power is a good target for power reduction, and that
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 15
don’t-cares are prevalent enough to enable a don’t-care based glitch reduction algorithm.
This algorithm will be presented in the next chapter.
Chapter 3
Glitch Reduction Using Don’t-Cares
3.1 Introduction
In this chapter, we present a glitch reduction optimization algorithm based on don’t-
cares. It sets the output values for the don’t-cares of logic functions in such a way that
reduces the amount of glitching. This process is performed after placement and routing,
using timing simulation data to guide the algorithm. Relative to prior published FPGA
glitch reduction techniques, our approach is entirely new, and leverages the ability to
re-program FPGA logic functions without altering the placement and routing. Since the
placement and routing are maintained, this optimization has zero cost in terms of area
and delay, and can be executed after timing closure is completed.
Section 3.2 describes the new algorithm for glitch reduction. Section 3.3 describes the
methodology for testing the algorithm. Section 3.4 shows the power reduction results,
and Section 3.5 summarizes the chapter.
3.2 Glitch Reduction Algorithm
We begin with an example to illustrate how don’t-cares can be used to prevent glitches.
The general idea is to simulate the circuit, then traverse the simulation vectors for each
16
Chapter 3. Glitch Reduction Using Don’t-Cares 17
(a)
(b)
Figure 3.1: Example: before glitch reduction. (a) LUT with don’t-care SRAM bit shaded.(b) Simulation waveform.
Chapter 3. Glitch Reduction Using Don’t-Cares 18
LUT, focusing on vectors corresponding to don’t-cares. We keep a count of the number of
instances for each don’t-care when we would prefer setting it to logic-0 or logic-1 (based
on the care outputs surrounding it). We will refer to these counts as “votes”. When the
end of the simulation vectors is reached, we set the don’t-cares to the value (logic-0 or
logic-1) corresponding to the more popular vote.
Figs. 3.1(a) and 3.1(b) show an example of a LUT and its simulation waveform. Let
us assume that the truth table row for abc = 100 corresponds to a don’t-care, found using
the method described in Section 2.5. We illustrate the don’t-care by shading its SRAM
configuration bit in Fig. 3.1(a). We also assume that the don’t-care bit is currently set
to logic-1 – an arbitrary choice. We initialize the vote counts to 0 (vote0 = 0, vote1 = 0).
Now, we traverse the waveform of Fig. 3.1(b) from left to right, stopping when we
encounter an input corresponding to a don’t-care (DC). In this case, we encounter the
don’t-care input abc = 100 in the second time step. We then consider the previous LUT
output and the next LUT output. In this case, we see that they are both logic-0. If
we were to change the output for abc = 100 to logic-0 instead of logic-1, we would be
able to prevent two glitch transitions on f . Therefore, we increment the vote counter
for logic-0 (vote0 = 1, vote1 = 0). In the fourth time step, we encounter another don’t-
care flanked by two logic-0 outputs. We increment the vote counter for logic-0 again
(vote0 = 2, vote1 = 0).
At the sixth time step, we see the the neighboring outputs of this don’t-care instance
are logic-0 and logic-1. In this case, there would be one transition on f whether the
don’t-care is set to logic-0 or logic-1. Therefore, no change is made to the vote counts
(vote0 = 2, vote1 = 0). At this point, we have exhausted the simulation waveform. We
set the don’t-care bit to logic-0, since vote0 is greater than vote1. The resulting LUT and
waveform are shown in Figs. 3.2(a) and 3.2(b). We can see that four glitch transitions
have been eliminated on output f .
A more formal expression of the glitch reduction algorithm is shown in Algorithm 1.
Chapter 3. Glitch Reduction Using Don’t-Cares 19
(a)
(b)
Figure 3.2: Example: after glitch reduction. (a) LUT with altered don’t-care SRAM bitshaded. (b) Simulation waveform with glitches removed.
Chapter 3. Glitch Reduction Using Don’t-Cares 20
It takes a placed and routed netlist as its input. We represent the netlist as a graph
G(V,E), where V is the set of vertices (LUTs) and E is the set of edges (routing wires).
The algorithm also takes a value change dump (VCD) file containing the results of a
timing simulation of the circuit. The simulation vectors are denoted as S, where the ith
local input vector to LUT n is denoted as Sn[i]. A timing simulation is needed rather than
a functional one because glitches arise from delay mismatches, which will only appear
under timing simulation.
The algorithm iterates through each LUT in the netlist, progressing from shallower
levels to deeper ones. This order is used because glitches prevented on shallower LUTs
will be prevented from propagating to deeper LUTs, thus saving more power. Within
each level, the LUTs are examined in descending order of power consumption. This
prioritizes the LUTs with the greatest potential savings. For each LUT, the following
steps are performed:
1. Compute the don’t-cares of the LUT.
2. Scan the input vectors.
3. Set the values of the don’t-cares.
3.2.1 Computing the Don’t-Cares for a LUT
As described previously in Section 2.5, we use ABC’s SAT-based don’t-care analysis
to compute the inputs states (minterms) for the particular LUT which are don’t cares
(Algorithm 1, line 3). DC is the set of don’t-care input states.
3.2.2 Scanning the Input Vectors
The sequence of local input vectors to the LUT (denoted Sn) is extracted from the timing
simulation VCD file. These input vectors are examined in order (line 5). When an input
Chapter 3. Glitch Reduction Using Don’t-Cares 21
Algorithm 1 Glitch reduction algorithm.Input: a netlist G(V, E) with simulation vectors SOutput: a netlist with modified LUT functions1: for each LUT n ∈ V in order of priority do2: {1. Compute the don’t-cares of the LUT}3: DC = compute dont cares(n)4: {2. Scan the input vectors}5: for i = 0 to size(Sn) do6: if Sn[i] ∈ DC then7: prev ← previous care output8: next← next care output9: if prev = 0 and next = 0 then10: V otes0(Sn[i])← V otes0(Sn[i]) + 111: else if prev = 1 and next = 1 then12: V otes1(Sn[i])← V otes1(Sn[i]) + 113: end if14: end if15: end for16: {3. Set the values of the don’t-cares and update netlist}17: for each don’t-care d ∈ DC do18: if V otes0(d) > V otes1(d) then19: assign 0 as the output of d20: else if V otes1(d) > V otes0(d) then21: assign 1 as the output of d22: end if23: end for24: end for
vector Sn[i] corresponding to a don’t-care is reached (line 6), we look at the closest
states in the past and future that correspond to care input vectors (lines 7-8). We use
this information to decide whether this don’t-care should be set to a logic-0, logic-1, or
whether there is no preference. If the closest past and future cares are identical (both
logic-0 or both logic-1) then the don’t-care should be set to the same value. Otherwise,
there is no preference. For each don’t-care minterm, a count of “votes” is kept, indicating
how many times in the simulation it would be beneficial to set it to a logic-0 or logic-1
(lines 9-12). This process is repeated for each input vector Sn[i] in the full simulation
time (lines 5-15).
Consider again the example shown in Fig. 2.2 and Table 2.1. Suppose that for input
Sn[i] = 100, the LUT output is a don’t-care. This means that even though it is assigned
to logic-1 in the truth table, we can assign it to logic-0 or logic-1 without affecting the
Chapter 3. Glitch Reduction Using Don’t-Cares 22
Figure 3.3: A cluster of don’t-cares.
functionality of the circuit. In this case, we see a glitch on f making a 0 → 1 → 0 → 0
transition as the inputs transition 000→ 100→ 110→ 111. Looking at the closest care
states before and after input 100, we see that they both output a logic-0. Therefore, the
algorithm votes for the output of 100 to be logic-0.
It is possible that the simulation data may include a long contiguous cluster of don’t-
cares. In these cases, the more desirable state could be the opposite of the one that
would be chosen by this algorithm. For example, it may be beneficial to set a particular
don’t-care to logic-0 within a cluster of logic-0’s (don’t-cares) in between two logic-1’s
(cares) rather than attempting to set the entire cluster to logic-1. This situation is
illustrated in Figure 3.2.2. The fourth time step shows a don’t-care surrounded by other
don’t-cares which have high vote0 (i.e. they will be set to 0). Therefore, we can see that
setting this DC to 0 would be preferable. However, the algorithm would set it to 1, as
the nearest cares are both 1. This would cause a glitch. Fortunately, experimental data
shows that such long clusters are uncommon. The average length of don’t-care clusters
in the benchmark set is 3.5. This justifies our use of the closest care input vectors.
3.2.3 Setting the Don’t-Cares
When the end of the input vectors is reached, each don’t-care is set to the value with more
votes (unless the votes are tied, in which case nothing is done – the choice is arbitrary).
Chapter 3. Glitch Reduction Using Don’t-Cares 23
The loop at lines 17-23 walks through each don’t-care d ∈ DC (the set of don’t-care
minterms) and checks whether logic-0 or 1 has a majority of votes. The netlist is updated
accordingly before proceeding to the next LUT. This is critical because changing the logic
function of one LUT can affect the don’t-cares of other LUTs, due to incompatibility
between don’t-cares [Mish 09]. By ensuring that the don’t-cares are computed using the
most recent information, the circuit is guaranteed to remain functionally-equivalent to
the original.
3.2.4 Iterative Flow
Following the modification of the circuit, the simulation results become outdated, due
to the changes to the LUT functions. Therefore, we repeat the simulation using the
modified circuit after performing glitch reduction on the full circuit. The algorithm is
then repeated. In practice, the majority of the glitch reduction occurs within the first
three iterations.
It is important to note that the loop of the iterative flow does not involve re-running
placement and routing. This is vital for two main reasons. First, the results of the
existing compilation will be preserved, so there is no interference with timing closure.
Second, the delays within the circuit will be kept the same, thus minimizing the amount
of change to the simulation vectors. This allows the algorithm to converge quickly.
The algorithm runtime is on the order of minutes for the benchmarks used. Al-
though the iterative process employs a timing simulation, the fact that this algorithm is
performed after place-and-route mitigates the issue of runtime. We envision a usage sce-
nario in which the designer runs this algorithm as part of a final pass after timing closure
has been achieved. Since no modifications are made to the circuit’s timing characteristics,
timing closure is preserved.
Chapter 3. Glitch Reduction Using Don’t-Cares 24
Figure 3.4: Experimental flow.
3.3 Methodology
We perform our glitch reduction algorithm on 20 MCNC benchmark circuits. The exper-
imental methodology was chosen to include commercial CAD tools wherever possible, to
evaluate the efficacy of the algorithm on real-world FPGAs. The flow is shown in Fig. 3.4.
We perform a full compilation using Quartus II 10.1 (synthesis, placement and routing)
targeting the Altera Stratix III 65nm FPGA family [Alteb]. This is followed by a timing
simulation using ModelSim SE 6.3e. For each circuit, 5000 random input vectors are
applied. We use a set of custom scripts to transform the simulation netlist generated by
Quartus into BLIF format, which can then be read into ABC, where the glitch reduction
is performed. Combinational equivalence checking (command cec in ABC [Mish 06c]) is
used after the glitch reduction step to ensure that the functionality of the circuit remains
the same. The output from ABC is used to modify the configuration bits in the simula-
tion netlist, thus ensuring that the placement and routing remain identical. Three passes
of the optimization loop are performed. Experiments show that very few changes, if any,
are made after this point (i.e. further iterations have virtually no effect). The power
measurements are performed using Quartus PowerPlay.
Chapter 3. Glitch Reduction Using Don’t-Cares 25
(a)
(b)
Figure 3.5: (a) Dynamic power reduction vs. baseline (default) don’t-care settings andworst-case settings. (b) Glitch power reduction vs. baseline (default) don’t-care settingsand worst-case settings.
Chapter 3. Glitch Reduction Using Don’t-Cares 26
3.4 Results
The leftmost bars in Fig. 3.5(a) (vs. baseline) represent the percentage reduction in total
core dynamic power after performing the glitch reduction algorithm. Immediately, we
can see that about half of the circuits benefit from the algorithm. The average reduction
is 4.0%, with a peak of 12.5%. Fig. 3.5(b) shows the corresponding reduction in glitch
power. The average reduction is 13.7%, with a peak of 49.0%. Naturally, the amount of
power reduction possible is based on the amount of glitching present and the number of
don’t-cares available. While the overall average power reductions are relatively modest,
we believe they will interest FPGA vendors and power-sensitive FPGA customers, as they
come at no cost to performance or area. For some circuits, over 10% power reduction
can be achieved essentially for “free”.
It is also interesting to look at the optimized power vs. the worst case don’t-care
settings possible, as illustrated by the rightmost bars in Fig. 3.5 (vs. worst-case). In this
experiment, we set the don’t-cares to the opposite of how they would normally be set
by our optimization algorithm, to examine the potential worst-case glitch power arising
from don’t-cares. Here, we see an average total dynamic power savings of 9.8% and
a peak savings of 30.8% (Fig. 3.5(a)). These results show that don’t-care settings can
potentially have a large impact on power if set to sub-optimal values.
The varied results in Fig. 3.5 can be correlated with the glitch power and don’t-care
data in Tables 2.2 and 2.3. For instance, des had a high glitch power in Table 2.2, yet
we did not observe a significant power reduction for this circuit. However, in Table 2.3,
we see that it had only 0.8% of LUT inputs as don’t-cares, thus reducing the number of
opportunities for optimization. On the other hand, pdc had a high amount of glitching
as well as ample don’t-cares, thus allowing it to be greatly improved by the algorithm –
12.5% dynamic power reduction.
We also examined the bias of votes cast on each don’t-care minterm in each LUT
in each circuit. The average results are shown in Fig. 3.6. The bias is defined as the
Chapter 3. Glitch Reduction Using Don’t-Cares 27
Figure 3.6: Average vote bias.
percentage of votes that were cast for the more popular setting, whether logic-0 or logic-
1. Bias is calculated for each don’t-care individually and averaged across the circuit. As
shown in the figure, the bias value tends to be in the 80-100% range, indicating that there
usually exists a highly preferable setting for a particular don’t-care minterm in a LUT.
This is an important observation because it indicates that our don’t-care settings are
providing a benefit most of the time (as opposed to the case of a bias around 50%, which
would imply that selecting either logic-0 or logic-1 for the don’t-care minterm is equally
good). These observations suggest that there usually exists a value for each don’t-care
(either 0 or 1) that is much better than the other, meaning that one can pick don’t-care
logic values with a high degree of confidence.
The relationship between don’t-cares, power and fanout presents a challenge to the
glitch reduction algorithm. Fanout is closely related to interconnect capacitance, and
interconnect can represent 60% of total FPGA dynamic power, on average [Shan 02].
Fig. 3.7(a) shows logic signal power consumption versus fanout, averaged across all signals
in all circuits. Observe that, as expected, average signal power increases with fanout,
due to the increase in capacitance. We also examined, for each signal, the fraction of
minterms in its driving LUT that were don’t-cares, and averaged this across all signals
Chapter 3. Glitch Reduction Using Don’t-Cares 28
(a) (b)
Figure 3.7: (a) Power per signal vs. fanout. (b) Normalized don’t-cares per node vs.fanout.
Figure 3.8: Fanout splitting.
of a given fanout in all circuits. The results are shown in Fig. 3.7(b). While the results
are “noisy” for high fanout (due to a small sample size for such fanouts), we see that, in
general, high fanout signals have fewer don’t-cares in their driving LUTs than low fanout
signals. The rationale for this is that high fanout signals are more likely to be used by
at least one of their fanouts, decreasing ODCs for such signals. Essentially, we have two
competing trends in that it is desirable to reduce the power of high fanout signals (as
they consume significant power), yet such signals exhibit fewer don’t-care opportunities.
Chapter 3. Glitch Reduction Using Don’t-Cares 29
Figure 3.9: Stratix III adaptive logic module (ALM) [Alteb].
3.4.1 Fanout Splitting
Based on the trend of high-fanout signals having fewer don’t-cares, it seemed reasonable
to examine this as a potential area for improvement. Consider a LUT f1 with fanout
LUTs FO1...FOn. Suppose that LUTs FO1...FOn−1 do not care about the value of f1
when its input is x, but FOn does care about it. Then x is a care for f1, thereby reducing
the amount of don’t-care optimization opportunities, even though only one of its fanouts
uses it.
A possible solution to this problem is to duplicate LUT f1, creating f2, and trans-
ferring fanout FOn from f1 to f2. This would increase the amount of don’t-cares on f1,
since x would now be a don’t-care. In general, f1 can be split into two LUTs, f ′1 and
f2 (i.e. we redistribute the fanout of f1, moving some of its fanout to f2). Each LUT
now has more don’t-care opportunities, since the cares “generated” by fanouts of f ′1 are
no longer present in f2, and vice versa. An example is given in Fig. 3.8. The LUT f1
has four fanouts which have care set 1 (illustrated by the hatch marks as a subset of the
truth table). In other words, if no other fanouts existed besides those four, the overall
care set of f1 would be care set 1. The fifth fanout has care set 2. The overall care set
of f1 is the union of these care sets.
By splitting the fanout of f1 among two new LUTs, f ′1 and f2, we can create two
Chapter 3. Glitch Reduction Using Don’t-Cares 30
Figure 3.10: Dynamic power reduction from fanout splitting.
LUTs with smaller cares sets and therefore more don’t-care optimization opportunities.
However, this incurs a power cost in duplicating the LUT and some routing resources.
The fanin routing would have to be duplicated for the new LUT. This would add to
the capacitance of these fanin signals. Fortunately, the Stratix III architecture [Alteb]
provides us with a way to mitigate this cost. The Adaptive Logic Module (ALM) shown
in Fig. 3.9 is essentially a pairing of two LUTs. By co-locating f1 and f2 in the same
ALM, we can virtually eliminate the cost of routing to an entirely new LUT. This is
because the routing to one LUT is shared with the routing to the other. This is a special
opportunity offered by the Stratix III architecture.
Figure 3.10 shows the dynamic power reduction resulting from fanout splitting. Some
circuits could not be placed and routed after fanout splitting due to illegal placement
constraints. This is because pairing certain LUTs into a single ALM may cause issues
with the compatibility between the LUTs. Unfortunately, the possible power reduction
is quite low, aside from a 5% reduction on alu4. Several circuits even show an increase
Chapter 3. Glitch Reduction Using Don’t-Cares 31
in power. This is due to the extra LUT that must be used, as well as its associated
routing resources. Considering the tradeoff of saving the occasional glitch transition
versus the overhead of adding more logic and routing resources, the fanout splitting is
rarely beneficial. Therefore, we decided not to further pursue fanout splitting.
3.5 Conclusion
In this chapter, we presented an analysis of glitch power in FPGAs and a method for glitch
reduction using don’t-cares in logic synthesis. We showed that glitch power is a significant
portion of total power, and that there exist ample opportunities for don’t-care-based
optimizations. A novel glitch reduction technique was presented that sets don’t-cares in
FPGA configuration bits in order to avoid glitch transitions. This method is performed
after placement and routing, and has no effect on circuit area or performance. The
algorithm was evaluated with a commercial 65nm FPGA architecture using a commercial
tool flow. The algorithm achieved an average total dynamic power reduction of 4.0%,
with a peak reduction of 12.5%; glitch power was reduced by up to 49.0%, and 13.7% on
average.
Chapter 4
FPGA CAD Algorithm Noise
4.1 Introduction
The process of designing a circuit for an FPGA platform generally involves writing code
in a hardware description language such as Verilog or VHDL, then compiling the code
to a bitstream that will be programmed onto the FPGA. This compilation process is
broken into a series of CAD stages. Due to the complex nature of these problems, the
CAD algorithms make use of heuristics to handle them.
CAD algorithms commonly encounter situations where a choice must be made be-
tween two or more alternatives that appear to have the same quality. For example, a
logic function might be implemented in multiple ways, each having the same local cost
in terms of area, delay, power, or some other metric. However, the choice of how that
function is implemented may have an unknown global effect on the quality of the circuit.
In practice, the choice may be arbitrarily made (e.g. always select the first alternative)
or it may be controlled with the use of a random number generator. By running the
algorithm multiple times using different seeds for the random number generator, we can
obtain a set of circuits with different characteristics. The variation in the quality of these
circuits (area, performance, power) through seemingly neutral changes is what we will
call noise. It is interesting to note that noise places a limit on the prediction accuracy of
32
Chapter 4. FPGA CAD Algorithm Noise 33
any timing/power estimation tools that are used prior to a noise-containing CAD algo-
rithm. One of the goals of this work is to quantize the amount of noise present in several
CAD algorithms.
The practice of trying multiple seeds, or “seed sweeping” is well-established for place-
ment and routing [Altea]. However, it is also possible for noise to be found in the logic
synthesis and technology mapping stages of the CAD flow. By exposing the noise in these
earlier stages, we hope to allow seed sweeping to take place earlier, in less time-consuming
stages.
In the following chapters, the following questions are addressed:
• Where in the CAD flow does noise come from?
• How much noise exists in the various stages of the compilation flow?
• Is there a way to predict the best circuits from a group of candidates, in the presence
of noise?
In this chapter, we examine several CAD algorithms in the logic synthesis and tech-
nology mapping stages and expose noise in those algorithms. To our knowledge, there
is no prior work studying noise in these algorithms, nor is there existing work on power
noise in FPGAs (variations in dynamic power consumption due to CAD algorithm noise).
Section 4.2 presents background on the particular CAD algorithms to be studied, and
uating the amount of noise present in a set of benchmark circuits. Section 4.4 shows
the performance and power results before place-and-route, while Section 4.5 shows the
results after place-and-route. Section 4.6 summarizes the chapter.
4.2 CAD Flow Stages
A typical FPGA CAD flow is shown in Fig. 4.1. It consists of the following steps:
Chapter 4. FPGA CAD Algorithm Noise 34
Figure 4.1: FPGA CAD flow.
• Logic Synthesis: The logic functions needed to implement the circuit are derived
and optimized. We will be exploring new ways to inject noise into this stage.
• Technology Mapping: The logic functions are mapped into the logic elements
specific to the target device architecture. We will investigate noise in this stage as
well.
• Packing: The logic elements are grouped into larger units corresponding to the
target device architecture. We do not introduce noise in this stage, because we
are using commercial tools to perform packing and have no way to modify the
algorithm.
• Placement and routing: The mapped logic elements are placed into physical
locations on the target device, and the proper connections are made between the
Chapter 4. FPGA CAD Algorithm Noise 35
logic elements using the programmable routing network. The presence of noise in
this stage has already been established, but it will still be considered in this work.
One of the few works to consider noise in FPGA CAD [Rubi 11] examines the
amount of delay noise in the routing stage of VPR [Betz 97]. The authors in-
voke randomness in the PathFinder routing algorithm by changing the order of
nets routed and making small perturbations in circuits. One experiment involves
changing the routing architecture to include some slightly faster wires such that the
maximum impact to critical path delay should be 0.5%. However, this modifica-
tion was experimentally shown to cause changes of -34% to +15%. This work also
proposes a technique to reduce this noise through delay-targeted routing, which
calculates the criticality of a route using a fixed delay target rather than a floating
one.
The work presented here focuses on the logic synthesis and technology mapping stages.
In particular, we use the algorithms implemented in the academic tool ABC [Berk 06].
These algorithms are explained in further detail below, as well as our new methods for
injecting noise into each of them.
4.2.1 Logic Synthesis
The algorithms studied in this stage act on an And-Inverter Graph (AIG) which is a
representation of a logic circuit using only two-input AND gates and inverters. This
is the primary data structure used by ABC. An example is shown in Fig. 4.2. The
large circles (nodes) represent AND gates, while the small dots on the edges represent
inversion. This example shows the function ¬x1 ∧ ((x2 ∨ x3) ∧ (x4 ∧ x5)).
The general goal of algorithms in this stage is to reduce the number of nodes in the
AIG and the number of logic levels, which is the maximum number of nodes from a
combinational input to a combinational output. In the example of Fig. 4.2, there are
Chapter 4. FPGA CAD Algorithm Noise 36
Figure 4.2: Example of an And-Inverter Graph (AIG).
four nodes and three levels.
Three potential noise sources were identified in the logic synthesis stage:
• AIG balancing: And-Inverter Graph balancing is a technique that aims to reduce
the number of levels in an AIG [Mish 11]. An example of this is shown in Figs. 4.3
and 4.4. In Fig. 4.3, we see in the ellipse an AIG representing a 5-input AND. This
subgraph of the AIG has a depth of 4. In Fig. 4.4, we see two examples of AIGs
that could be generated by AIG balancing. In both cases, the number of levels is
reduced to 3. However, the balancing can be done in multiple ways, by placing
different signals on the shallower inputs. Balancing is done in two main steps:
– Tree covering: This step identifies multi-input AND gates in the AIG by
grouping together nodes which are not inverted and have no external fanout.
Chapter 4. FPGA CAD Algorithm Noise 37
Figure 4.3: Example of an AIG before balancing (logic levels shown in parentheses).
An example is shown by the ellipse in Fig. 4.3. The tree cannot be expanded
to include x4 as it is inverted, and it cannot include x5 as it has another
fanout.
– Tree balancing: For each multi-input AND gate identified by the tree cov-
ering stage, the tree balancing stage decomposes it into a balanced tree of
two-input AND gates. The balancing is done considering the logic levels of
the nodes feeding the multi-input AND. The process is shown in Algorithm 2.
The algorithm essentially pairs nodes together until the tree is formed. It
begins by taking the lowest level node as the first one to be paired (line 3).
It then finds the nodes with the next lowest level, between the indices of
leftBound and rightBound (lines 5-9).
Chapter 4. FPGA CAD Algorithm Noise 38
Figure 4.4: Examples of balanced AIGs.
At this point, the ABC code makes an arbitrary selection between the nodes
(selecting in the same way every time). However, we change the algorithm to
select one of these nodes randomly (line 11). The rand(m,n) function gives a
random integer between m and n (inclusive). Finally, this node is paired with
the first one into a two-input AND gate, replacing the original nodes. This
process continues until the last two nodes are paired. Consider the example in
Fig. 4.3 where the logic levels are in parentheses. Two nodes with the lowest
level (3) are randomly chosen and paired into another node with level 4. The
pairing might proceed as follows (Fig. 4.4, left):
∗ Begin by sorting input nodes in descending order by level
x1(4), x2(4), x3(3), x4(3), x5(3)
∗ Randomly choose two nodes with the lowest level (x4(3) and x5(3)) and
combine them into a new node, x45(4)
x1(4), x2(4), x45(4), x3(3)
Chapter 4. FPGA CAD Algorithm Noise 39
Figure 4.5: Example of AIG rewriting.
∗ Combine x3(3) with a random level 4 node, x45(4), to form x345(5)
x345(5), x1(4), x2(4)
∗ Combine two level 4 nodes x1(4) and x2(4) into x12(5)
x12(5), x345(5)
∗ Combine final two nodes to complete balancing
x12345(6)
Alternatively, the random pairing may be done this way (Fig. 4.4, right):
∗ x1(4), x2(4), x3(3), x4(3), x5(3)
∗ Select x3(3) and x5(3) randomly (instead of x4(3) and x5(3) as before)
x1(4), x2(4), x35(4), x4(3)
∗ Combine x4(3) with a random level 4 node, x1(4), to form x14(5)
x14(5), x2(4), x35(4)
∗ Combine two level 4 nodes x2(4) and x35(4) into x235(5)
x14(5), x235(5)
∗ Combine final two nodes to complete balancing
x12345(6)
Chapter 4. FPGA CAD Algorithm Noise 40
Algorithm 2 Tree balancing algorithm.Input: a vector V of input nodes to the multi-input AND gate, sorted by decreasing levelOutput: a balanced AIG1: while size(V ) > 1 do2: {1. Get the node with minimum level}3: node1← V [size(V )− 1]4: {2. Identify the nodes with the next lowest level (between leftBound and rightBound)}5: rightBound← size(V )− 26: leftBound← rightBound7: while leftBound ≥ 0 and level(V [leftBound− 1]) = level(V [rightBound]) do8: leftBound← leftBound− 19: end while10: {3. Select a node randomly (NEW)}11: node2← V [rand(leftBound, rightBound)]12: {4. Pair the nodes}13: newNode← AND(node1, node2)14: remove(V, node1, node2)15: insert(V, newNode)16: end while17: return newNode
This shows that the tree balancing stage does not provide a unique solution,
and is therefore a source of noise.
• AIG rewriting: AIG rewriting is an algorithm that reduces the number of nodes/logic
levels in an AIG by examining subgraphs of nodes and replacing them with lower-
cost substitutes [Mish 06b]. An example is shown in Fig. 4.5. In this case, the
AIG subgraph represents a 3-input AND. Using rewriting, it can be reduced from
3 nodes to 2. Algorithm 3 loops through each node n in the AIG (line 1) and enu-
merates all 4-input cuts of n (line 2). A cut of a node n is a set of nodes (leaves)
such that each path from a primary input to n passes through at least one node
of the cut. Each cut is replaced with equivalent AIG subgraphs from a hash table
of precomputed subgraphs (line 5). If the subgraph leads to a reduction in AIG
nodes, it is kept.
To add randomness at this stage, we modify the algorithm to allow changes even
when the replacement leads to no change in the AIG node count. It is kept with
a 50% probability (line 8). These are known as “zero-cost” replacements. If a
Chapter 4. FPGA CAD Algorithm Noise 41
Algorithm 3 AIG rewriting algorithm.Input: an AIG and a hash table of precomputed subgraphs SOutput: a rewritten AIG1: for each node n of the AIG in topological order do2: for each 4-input cut c of n do3: bestGain← −14: bestS ← NULL5: for each possible rewriting option s from HashLookup(S, c) do6: gain← SavedNodes(s, c)−AddedNodes(s, c)7: {If zero-cost, keep the change with 50% probability (NEW).}8: if gain > 0 or (gain = 0 and rand(0, 1)) then9: if bestS = NULL or gain ≥ bestGain then10: bestGain← gain11: bestS ← s12: end if13: end if14: end for15: if bestS ̸= NULL then16: Update(AIG, bestS)17: end if18: end for19: end for
new subgraph is found (leading to a cost reduction or zero-cost change), the AIG is
updated (line 16). The zero-cost replacements are a source of noise for the rewriting
algorithm.
• AIG refactoring: This technique involves computing one large cut for each AIG
node, then replacing it with a factored form with fewer nodes [Mish 06a]. The cuts
are chosen based on how much reconvergence they contain, which is an indicator of
redundancy that can be exploited by refactoring. Refactoring differs from rewriting
in that it acts on a larger scale (by default, rewriting is done on 4-input cuts, while
refactoring can go as high as 16). The noise injection in this stage is similar to
the method used in AIG rewriting. New AIG subgraphs are generated, and the
replacements are made with a 50% probability if the new subgraphs result in a
zero-cost change.
The above algorithms are repeated in sequence several times as part of the ABC script
resyn2. This lets the algorithms create new optimization opportunities for each other.
Chapter 4. FPGA CAD Algorithm Noise 42
Algorithm 4 Cut comparison algorithm.Input: two cuts, c1 and c2Output: 1 if c1 is better, -1 if c2 is better1: if metric1(c1) > metric1(c2) then2: return 13: end if4: if metric1(c2) > metric1(c1) then5: return -16: end if7: {...repeat for all metrics...}8: if metricn(c1) > metricn(c2) then9: return 110: end if11: if metricn(c2) > metricn(c1) then12: return -113: end if14: {If still tied, decide order randomly (NEW).}15: if rand(0, 1) then16: return 117: else18: return -119: end if
4.2.2 Technology Mapping
We use the priority cut-based technology mapping algorithm in ABC [Mish 07]. The
goal of this stage is to map the logic of the AIG to K-input functions which can be
implemented by LUTs on the FPGA (K depends on the FPGA architecture). The
mapper does this by first evaluating a set of priority cuts for each node in an AIG. These
cuts represent potential LUT implementations of that node. The cuts are selected and
sorted in terms of delay, number of inputs, and area. At this point in the CAD flow,
logic depth is used as a proxy for delay, and the number of LUTs is used as a proxy for
area.
The priority cuts for each node are sorted by several criteria, depending on the map-
ping parameters. These criteria include depth and cut size. The random noise in this
stage comes from deciding between cuts with the same values for each of these metrics.
Algorithm 4 shows the cut comparison function used when sorting priority cuts. It begins
by comparing the two cuts for each metric in order. If the cuts are tied in all cases, our
modification to the algorithm makes a random selection is made between them.
Chapter 4. FPGA CAD Algorithm Noise 43
The mapping algorithm makes several passes over the netlist, stitching together the
best results (using depth-optimal mappings on critical paths, and area-oriented mappings
elsewhere). We introduce noise in two stages:
• Depth-oriented mapping: Here, the logic depth metric is prioritized over area.
• Area-oriented mapping: Area is prioritized over depth.
4.2.3 Placement and Routing
The placement problem deals with assigning physical locations to each of the logic
blocks in a circuit. A common technique for placement is simulated annealing [Kirk 83].
This algorithm mimics the annealing of metals, a process in which a material is heated,
then cooled in order for its atoms to settle into a low-energy configuration. In placement,
the “atoms” are logic blocks, which are moved around randomly. The random moves are
controlled by the current “temperature” of the anneal, which determines the likelihood
of accepting moves even when they reduce the current quality of the placement. This
“hill-climbing” quality allows the algorithm to avoid being stuck in a local minimum of
the solution space.
The placement and routing stages of the CAD flow are done using Quartus II 10.1, a
commercial CAD tool from Altera. Since this is a commercial tool, we cannot implement
our own noise injection method. Instead, the noise injection in this stage is simply a
matter of changing the seed option to Quartus’ place-and-route tool. Although place-
and-route noise is not the focus of this work, it is still important to consider noise in this
stage because it can mask the effects of noise in the previous stages. For example, a good
placement (due to noise) might hide the negative effects of a mapping with inherently
bad quality, or vice versa. Therefore, it is necessary to evaluate the noise in this stage
and try to separate it from the noise in the previous stages.
Chapter 4. FPGA CAD Algorithm Noise 44
4.3 Methodology
In order to evaluate noise over a large number of random seeds, a large number of circuit
compilations were performed (over 10000). To do so, experiments were conducted on the
SciNet high-performance computing system [Loke 10]. This allowed us to perform many
compilations in parallel, allowing thousands of compiles to complete within a few days.
The tools used are the same as the ones used in the glitch power analysis section. We
use Altera’s Quartus 10.1 to pack, place and route the circuits and perform timing and
power analysis. The target FPGA family is Altera’s Stratix III. Modelsim 6.3e is used
for simulation to get toggle rates for the power estimation. 5000 random input vectors
are applied to each circuit.
The first benchmark set consists of 20 MCNC circuits. A second set of 7 benchmarks
was taken from the VPR 5.0 benchmark set, in order to have data for some larger circuits.
The circuits were selected by removing ones that were too similar to others (e.g. two
FIR filters with different parameters), and removing those which did not show significant
toggling on nets when subjected to random vector simulation (some circuits may require
specific input patterns to become active). This was done in order to get meaningful
dynamic power data.
4.4 Noise Measurement: Before Place and Route
In this section, the word “design” will be used to refer to all circuits having the same
original source file (e.g. “alu4” is a design). A “circuit” will refer to a particular
compilation of the design using certain seeds (e.g. “alu4” compiled with synthesis seed 1
and mapping seed 2 is a circuit). Six noise injection experiments are presented: one for
each of five noise injection stages tested individually, as well as one experiment containing
all noise injection stages. For each experiment, each of the 27 designs was processed by
ABC using 25 different seeds (making 25 ∗ 27 = 675 circuits). The results of the noise
Chapter 4. FPGA CAD Algorithm Noise 45
Figure 4.6: Number of circuits vs. normalized nodes/level (balancing noise).
Figure 4.7: Number of circuits vs. normalized nodes/level (rewriting noise).
Chapter 4. FPGA CAD Algorithm Noise 46
Figure 4.8: Number of circuits vs. normalized nodes/level (refactoring noise).
Figure 4.9: Number of circuits vs. normalized nodes/level (depth-oriented mappingnoise).
Chapter 4. FPGA CAD Algorithm Noise 47
Figure 4.10: Number of circuits vs. normalized nodes/level (area-oriented mappingnoise).
Figure 4.11: Number of circuits vs. normalized nodes/level (all noise).
Chapter 4. FPGA CAD Algorithm Noise 48
injection experiments are as follows:
1. AIG balancing (Fig. 4.6)
2. AIG rewriting (Fig. 4.7)
3. AIG refactoring (Fig. 4.8)
4. Technology mapping - depth-oriented (Fig. 4.9)
5. Technology mapping - area-oriented (Fig. 4.10)
6. All of the above (Fig. 4.11)
Each graph shows a histogram of the noise distribution for that stage. The x-axis
shows the normalized number of nodes (AIG nodes for unmapped circuits, LUTs for
mapped circuits) and the normalized number of logic levels. The results are normalized
to the average for each design. The y-axis shows the number of circuits in each bin (a
bin contains circuits falling to the left of its label). All graphs are shown with the same
scale to facilitate comparison.
From inspection, the circuits tend to fall in a normal probability distribution. The
synthesis stages (Figs. 4.6, 4.7, 4.8: balancing, rewriting, refactoring) tend to show wider
distributions. Note that the outliers in level count are generally due to low numbers
of logic levels (relative to the number of nodes in the circuit). It was observed that
any deviations in logic depth are limited to a single level. The node count distributions
are generally smoother. In the balancing stage, the majority of circuits appear to be
contained within +/- 1.0% of the mean (i.e. between 0.99 and 1.01), while the rewriting
and refactoring stages are tighter – around +/- 0.6% of the mean.
In contrast, noise in technology mapping (Figs. 4.9, 4.10) is much less than in syn-
thesis. The noise distributions are far narrower, showing that most circuits are within
0.2% of the mean in terms of node count and levels. When all noise injection stages are
Chapter 4. FPGA CAD Algorithm Noise 49
Table 4.1: Standard deviation of noise (before place-and-route)Noise injection Node stdev. Level stdev.
[Alteb] Altera. “Stratix III device handbook”. http://www.altera.com/
literature/lit-stx3.jsp.
[Altec] Altera. “Stratix V device handbook”. http://www.altera.com/
literature/lit-stratix-v.jsp.
[Ande 06] J. Anderson and F. Najm. “Active leakage power optimization for FPGAs”.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on, Vol. 25, No. 3, pp. 423 – 437, march 2006.
[Berk 06] Berkeley Logic Synthesis and Verification Group. “ABC: A system for se-quential synthesis and verification”. Release 00406. http://www.eecs.
berkeley.edu/∼alanmi/abc/.
[Betz 97] V. Betz and J. Rose. “VPR: A new packing, placement and routing toolfor FPGA research”. In: Proceedings of the 7th International Workshop onField-Programmable Logic and Applications, pp. 213–222, Springer-Verlag,London, UK, 1997.
[Chen 07a] D. Chen, J. Cong, Y. Fan, and Z. Zhang. “High-level power estimation andlow-power design space exploration for FPGAs”. In: Design AutomationConference, 2007. ASP-DAC ’07. Asia and South Pacific, pp. 529 –534, Jan.2007.
[Chen 07b] L. Cheng, D. Chen, and M. Wong. “GlitchMap: An FPGA technology map-per for low power considering glitches”. In: Design Automation Conference,2007. DAC ’07. 44th ACM/IEEE, pp. 318 –323, 2007.
[Czaj 07] T. S. Czajkowski and S. D. Brown. “Using negative edge triggered FFsto reduce glitching power in FPGA circuits”. In: Proceedings of the 44thannual Design Automation Conference, pp. 324–329, ACM, New York, NY,USA, 2007.
[Dinh 09] Q. Dinh, D. Chen, and M. D. Wong. “A routing approach to reduce glitchesin low power FPGAs”. In: Proceedings of the 2009 international symposiumon Physical design, pp. 99–106, ACM, New York, NY, USA, 2009.
81
Bibliography 82
[Fisc 05] R. Fischer, K. Buchenrieder, and U. Nageldinger. “Reducing the power con-sumption of FPGAs through retiming”. In: Engineering of Computer-BasedSystems, 2005. ECBS ’05. 12th IEEE International Conference and Work-shops on the, pp. 89 – 94, Apr. 2005.
[Fran 10] S. Franssila. Introduction to Microfabrication. John Wiley & Sons, 2010.
[Gort 10] M. Gort and J. Anderson. “Deterministic multi-core parallel routing forFPGAs”. In: Field-Programmable Technology (FPT), 2010 InternationalConference on, pp. 78 –86, Dec. 2010.
[Kahn 05] A. Kahng and S. Reda. “Intrinsic shortest path length: a new, accurate apriori wirelength estimator”. In: Computer-Aided Design, 2005. ICCAD-2005. IEEE/ACM International Conference on, pp. 173 – 180, Nov. 2005.
[Kirk 83] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by simulatedannealing”. Science, Vol. 220, No. 4598, pp. 671–680, 1983.
[Kuon 07] I. Kuon and J. Rose. “Measuring the gap between FPGAs and ASICs”.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on, Vol. 26, No. 2, pp. 203 –215, Feb. 2007.
[Lamo 08] J. Lamoureux, G. Lemieux, and S. Wilton. “GlitchLess: Dynamic powerminimization in FPGAs through edge alignment and glitch filtering”. VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, Vol. 16,No. 11, pp. 1521 –1534, Nov. 2008.
[Lim 05] H. Lim, K. Lee, Y. Cho, and N. Chang. “Flip-flop insertion with shifted-phase clocks for FPGA power reduction”. In: ICCAD ’05: Proceedings ofthe 2005 IEEE/ACM International conference on Computer-aided design,pp. 335–342, IEEE Computer Society, Washington, DC, USA, 2005.
[Lin 95] B. Lin and S. Devadas. “Synthesis of hazard-free multilevel logic undermultiple-input changes from binary decision diagrams”. Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on, Vol. 14,No. 8, pp. 974 –985, Aug. 1995.
[Liu 04] Q. Liu and M. Marek-Sadowska. “Pre-layout wire length and congestionestimation”. In: Design Automation Conference, 2004. Proceedings. 41st,pp. 582 –587, Jul. 2004.
[Liu 05] Q. Liu and M. Marek-Sadowska. “Pre-layout physical connectivity predictionwith application in clustering-based placement”. In: Computer Design: VLSIin Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEEInternational Conference on, pp. 31 – 37, Oct. 2005.
[Loke 10] C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques,J. Dempsey, C.-H. Yu, J. Chen, L. J. Dursi, J. Chong, S. Northrup, J. Pinto,
Bibliography 83
N. Knecht, and R. V. Zon. “SciNet: Lessons learned from building a power-efficient Top-20 system and data centre”. Journal of Physics: ConferenceSeries, Vol. 256, No. 1, p. 012026, 2010.
[Mano 07] V. Manohararajah, G. Chiu, D. Singh, and S. Brown. “Predicting intercon-nect delay for physical synthesis in a FPGA CAD flow”. Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on, Vol. 15, No. 8, pp. 895–903, Aug. 2007.
[Mish 05] A. Mishchenko and R. Brayton. “SAT-based complete don’t-care compu-tation for network optimization”. In: ACM/IEEE Design Automation andTest Conference, pp. 412–417, 2005.
[Mish 06a] A. Mishchenko and R. Brayton. “Scalable logic synthesis using a simplecircuit structure”. In: Proc. International Workshop on Logic and Synthesis,pp. 15–22, 2006.
[Mish 06b] A. Mishchenko, S. Chatterjee, and R. Brayton. “DAG-aware AIG rewrit-ing: a fresh look at combinational logic synthesis”. In: Design AutomationConference, 2006 43rd ACM/IEEE, pp. 532 –535, 2006.
[Mish 06c] A. Mishchenko, S. Chatterjee, R. Brayton, and N. Een. “Improvementsto combinational equivalence checking”. In: Proceedings of the 2006IEEE/ACM international conference on Computer-aided design, pp. 836–843, ACM, New York, NY, USA, 2006.
[Mish 07] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton. “Combinational andsequential mapping with priority cuts”. In: Computer-Aided Design, 2007.ICCAD 2007. IEEE/ACM International Conference on, pp. 354 –361, Nov.2007.
[Mish 09] A. Mishchenko, R. Brayton, J.-H. R. Jiang, and S. Jang. “Scalable don’t-care-based logic optimization and resynthesis”. In: Proceedings of theACM/SIGDA international symposium on Field programmable gate arrays,pp. 151–160, ACM, New York, NY, USA, 2009.
[Mish 11] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets. “Delay optimizationusing SOP balancing”. In: Proc. International Workshop on Logic and Syn-thesis, pp. 75–82, 2011.
[Pand 07] A. Pandit and A. Akoglu. “Wirelength prediction for FPGAs”. In: FieldProgrammable Logic and Applications, 2007. FPL 2007. International Con-ference on, pp. 749 –752, Aug. 2007.
[Rubi 11] R. Y. Rubin and A. M. DeHon. “Timing-driven pathfinder pathology andremediation: quantifying and reducing delay noise in VPR-pathfinder”. In:Proceedings of the 19th ACM/SIGDA international symposium on Field pro-grammable gate arrays, pp. 173–176, ACM, New York, NY, USA, 2011.
Bibliography 84
[Shan 02] L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power consumption inVirtex-II FPGA family”. In: Proceedings of the 2002 ACM/SIGDA tenthinternational symposium on Field-programmable gate arrays, pp. 157–164,ACM, New York, NY, USA, 2002.
[Shum 11] W. Shum and J. H. Anderson. “FPGA glitch power analysis and reduction”.In: Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design, pp. 27–32, IEEE Press, Piscataway, NJ, USA,2011.
[Sing 05] D. Singh, V. Manohararajah, and S. Brown. “Two-stage physical synthesisfor FPGAs”. In: Custom Integrated Circuits Conference, 2005. Proceedingsof the IEEE 2005, pp. 171 – 178, Sept. 2005.
[Wilt 04] S. J. Wilton, S.-S. Ang, and W. Luk. “The impact of pipelining on energyper operation in field-programmable gate arrays”. In: Proc. Intl. Conf. onField-Programmable Logic and its Applications, pp. 719–728, 2004.
[Xili] Xilinx. “7 Series FPGAs overview”. http://www.xilinx.com/support/