INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE By AARON LANDY A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2013
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
INTERMEDIATE FABRICS:LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO
ENABLE FAST PLACE AND ROUTE
By
AARON LANDY
A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2013
c⃝ 2013 Aaron Landy
2
ACKNOWLEDGMENTS
I thank the chair and members of my supervisory committee for their mentoring
and time, the University of Florida Graduate School, the National Science Foundation,
and the NSF Center for High Performance Reconfigurable Computing (CHREC) for their
generous support. I thank my parents for their many years of loving encouragement, and
as resources). Intermediate fabrics can also implement LUT-based architectures, but
13
instead are usually specialized for specific domains and even individual applications
using a resource granularity uncommon to FPGAs, which provides fast place-and-route.
Previous virtual FPGAs can be viewed as specific, low-level instances of an intermediate
fabric. One key difference is that because intermediate fabrics can be specialized,
interconnect requirements differ from fine-grained virtual FPGAs, and also vary
between specializations. Numerous previous studies have introduced reconfigurable,
coarse-grained physical devices for different application domains [5] [10] [15] [22]
[24] [34] [40] [41] [43]. Although those devices provide good performance for their
targeted applications, the disadvantage of such an approach is that specialized physical
devices generally have high costs due to limited economy of scale. Intermediate
fabrics can provide the same architectures implemented virtually atop common
commercial-off-the-shelf FPGAs, which has significant cost advantages and an
acceptable overhead for some use cases. Several studies have also considered virtual
coarse-grained architectures for specific domains [41] [45]. These approaches are
complementary and represent individual instances of intermediate fabrics.
2.3 Constant Propagation
Many studies have shown that constant propagation can increase functional
density and performance [12] [13] [19] [20] [21] [23] [31]. While those techniques
are effective, synthesis must be able to statically identify constants. The presented
work enables these optimizations in cases where a constant value is not known at
compile time, and also when a value changes with low frequency. Previous studies have
demonstrated a concept similar to pseudo-constants by using partial reconfiguration
for run-time logic minimization [17] [23] [31] [32] [44] [46]. Previous work also showed
that partial reconfiguration can have prohibitive reconfiguration times, implementation
complexity, and limitations on reconfiguration granularity [14] [35] [46]. This past work
examined trade-offs between area and reconfiguration time when using run-time logic
optimization, and included the development of a functional density metric to quantify the
14
advantages. We extend past work by reducing reconfiguration times and implementation
complexity via the LUT-based RAM primitives provided by most FPGAs. Prior studies
have also used LUT RAM as dynamically reconfigurable logic. The FPGA overlay
network presented by Brant et al. [7] used LUT RAM to implement virtual LUTs in a
virtual FPGA fabric. That work also decreased multiplexer resources via an approach
similar to what we describe. We expand upon that work by generalizing pseudo-constant
logic optimization for potentially any logic function.
2.4 Intermediate Fabrics
In [11], Coole and Stitt introduce intermediate fabrics as a possible solution to
exceedingly long FPGA place-and-route times. They also propose fabric specialization
to address area overhead concerns. Using specialization, fabric overhead can be
reduced by including in the fabric only those resources essential to implement a given
application. This represents the lowest overhead achievable by early intermediate
fabrics, but pays signifcant penalty in fabric resuability. The optimizations presented in
this work offer alternative approaches to overhead reduction without sacrificing fabric
reusability.
15
CHAPTER 3INTERCONNECT ENHANCEMENTS
This chapter discusses enhancements made to the intermediate fabric interconnect
architecture to reduce area overhead while minimizing routability tradeoffs. The chapter
first provides an overview of the intermediate fabric architecture and the interconnect
used by initial intermediate fabric studies. It then details the optimized interconnect style
and finally compares area overhead and routability between the original and optimized
interconnect.
3.1 Intermediate Fabric Architecture
This section overviews intermediate fabrics in Section 3.1.1 and then discusses the
virtual interconnect architecture used by previous intermediate fabrics in Section 3.1.2.
3.1.1 Overview
As shown in Figure 3-1, an intermediate fabric is a virtual reconfigurable device,
implemented atop a physical FPGA, which implements circuits from HDL or high-level
code via synthesis, placement, and routing. Intermediate fabrics, like overlay networks
[26] and virtual FPGAs [7][33], provide a fabric capable of implementing numerous
circuits. However, unlike those techniques, intermediate fabrics tend to be specialized
for the requirements of a specific set of applications, while providing enough routability
to support similar applications or different functions in the same domain. The example
in Figure 3-1 illustrates an intermediate fabric specialized for a frequency-domain
signal-processing circuit, and provides corresponding floating-point resources for
FFTs and arithmetic computation. When directly compiling this circuit to an FPGA,
place-and-route is likely to require hours due to the compiler decomposing the circuit
into tens-of-thousands of LUTs. However, when targeting the intermediate fabric,
the compiler decomposes the circuit into several coarse-grained resources, which
reduces the place-and-route input size by orders of magnitude and provides 100x to
1000x place-and-route speedup [11][42]. A complete discussion of intermediate fabric
16
FPGA
Application Circuit
w/ Floating-Point
Operations
Synthesis,
Place & Route
Fabric
Library
FFT
*+/-
*
* **+/- +/-+/-
FFT
IFFT*
FFT
*+/-
*
* **+/- +/-+/-
FFT
IFFT*
FFT
* *-
FFT
IFFT
Intermediate Fabric (IF)
w/ Floating-Point Resources
1) Fast compilation via abstraction
(few course-grained resources as
opposed to 100k LUTs)
2) Circuit portability across physical FPGAs
Intermediate Fabric
Figure 3-1. Intermediate fabrics (IFs) are virtual application-specialized fabricsimplemented atop FPGAs that hide physical device complexity to achievefast place-and-route and application portability.
usage models and their implementations is outside the scope of this paper; we instead
summarize two basic models. The library model provides a large, pre-implemented
set of intermediate fabrics that a designer or synthesis tool can choose from based on
the requirements of the application. For the example in Figure 3-1, a designer or tool
could choose the selected fabric from one of many fabrics that provide different fabric
sizes, different combinations of resources, different precisions, etc. An alternative is the
synthesis model, during which the synthesis tool creates a specialized fabric based on
the application requirements. The advantage to the synthesis model is reduced area
overhead. However, the disadvantage is that the application designer must wait for
place-and-route to implement the intermediate fabric on the physical FPGA. Although
such place-and-route may require hours, the compilation time is amortized over the
lifetime of the fabric because the physical place-and-route is only needed once.
17
Computational Unit
(CU)
Switch
Box
(SB)
Switch
Box
(SB)
Connection
Box (CB)Connection
Box (CB)
Connection
Box (CB)
Switch
Box
(SB)
Switch
Box
(SB)
Connection
Box (CB)
Connection
Box
Switch
Box
East
CU North
OutputInput
CU
North
Output
Switch
Box
East
Source
Switch
Box
West
Source
CU
South
Output
CU
North
Input
Switch
Box
East
Sink
Switch
Box
West
Sink
CU
South
Input
CU South
InputOutput
a) b) c)
Switch
Box
West
Routing
Track
Routing
Track
Track Sinks
Track Sources
mux select
Configuration bits
Figure 3-2. Previous intermediate fabric interconnect architecture, where (b) routingtracks between resources were implemented as (c) multiplexers based onthe number of track sources.
3.1.2 Previous Interconnect Architecture
Figure 3-2(a) illustrates the basic island-style fabric used in previous intermediate
fabrics [11][42]. Such a fabric closely imitates the widely studied structure of physical
FPGAs consisting of switch boxes, connection boxes, and bidirectional routing tracks,
but replaces LUTs with application-specific resources (e.g., floating-point units, FFTs)
referred to as computational units (CUs). Note that because intermediate fabrics
can be specialized, the CUs and virtual routing tracks can potentially be any width.
For example, a fabric with floating-point CUs might provide 32-bit routing tracks.
Intermediate fabrics also contain specialized regions for control and memory operations.
However, in this paper, we focus on the areas of a circuit that contribute the most to long
place-and-route, which for many applications are coarse-grained, pipelined datapath
operations (e.g., FFTs).
The main limitation of previous intermediate fabrics is area overhead incurred
by implementing the virtual fabric atop a physical FPGA (i.e., synthesized VHDL for
the virtual fabric). Such overhead results from several sources. The largest source
of overhead comes from mux logic in the virtual interconnect. Previous intermediate
fabrics use virtual bidirectional routing tracks [11][42], whose register-transfer-level
(RTL) implementation is shown in Figure 3-2(b) and (c). For an m-bit track with n
possible sources, the RTL implementation uses an m-bit, n:1 mux, in some cases with
18
a register or latch on the mux output. For example, Figure 3-2(b) shows a common
configuration of a bidirectional track with four sources: two switch boxes and two CUs,
with the corresponding RTL implementation shown in Figure 3-2(c) as a 4:1 mux, with
a select value stored in a 2-bit virtual configuration register. Considering the large
number of tracks found in most fabrics, this mux-based implementation of virtual tracks
uses numerous LUT resources in the physical FPGA, and is responsible for over 50%
of the total LUT usage in many intermediate fabrics. Similarly, virtual switch boxes
and connection boxes implement various topologies using additional muxes between
virtual tracks. The exact percentage of LUT usage for switch/connection boxes varies
depending on the box topology and flexibility, but is also a significant contributor to
area overhead. When combining all interconnect resources (tracks, switch boxes, and
connection boxes), we determined that the virtual interconnect is commonly responsible
for over 90% of LUT requirements. In addition to the mux overhead, intermediate fabrics
also require physical flip-flop resources for any storage. Virtual registers are technically
not overhead because synthesis tools can directly implement virtual registers on
physical flip-flops in the FPGA. However, virtual configuration flip-flops and any pipelined
interconnect is overhead because the resulting physical flip-flops would not be used by a
circuit directly targeting the FPGA.
3.2 Optimized Interconnect
Based on the significant overhead caused by the virtual interconnect described
in the previous section, in this paper we focus on virtual interconnect optimizations
to reduce muxes, with the goal of retaining high routability. During an initial attempt
at optimizing virtual tracks, we observed that the RTL implementation shown in
Figure 3-2(c) contains some redundancy that could potentially be removed. Specifically,
a physical track would never have a common source and sink, which results in an
unnecessary input to the mux. For example, a physical FPGA would never route a signal
out of a switch box and back into the same switch box using the same track. Therefore,
19
CU
North
Output
Switch
Box
East
Source
Switch
Box
West
Source
CU
South
Output
CU
North
Input
Switch
Box
East
Sink
Switch
Box
West
Sink
CU
South
Input
Component1
Output
Component2
Output
Component1
Sink
Component2
Sink
a) b)
Figure 3-3. (a) An optimized virtual-track implementation to reduce routing redundancy,which eliminates muxes when (b) tracks have two sources.
we can eliminate the redundant routes and replace the n:1 mux with n different, n-1:1
muxes, where each mux defines one of the possible track destinations. Figure 3-3(a)
shows an example for the previous track in Figure 3-2(c), where n=4. Despite eliminating
routing redundancy, such an approach does not save area because in most cases, n
separate n-1:1 muxes require more LUTs than a single n:1 mux.
However, we have observed there is a special case where the track implementation
in Figure 3-3(a) can achieve reduced area. For any virtual track with exactly two
possible sources, this implementation simplifies into two directional wires as shown
in Figure 3-3(b). In other words, a 2-source virtual track requires two separate 1:1
muxes, but a 1:1 mux is just a wire. Therefore, by using only 2-source virtual tracks
throughout the entire intermediate fabric, we can potentially replace all mux logic and
wires in Figure 3-3(a) with two wires for each track. Such an optimization has significant
potential due to virtual tracks contributing to over 50% of area overhead. Furthermore,
this optimization saves a significant amount of wires per track, while simultaneously
improving routability by enabling routing in two directions. An additional advantage
20
CU
Input
Output
Switch
Box
CU
Input
Output
Switch
Box
Switch
Box
CU
Input
Output
Switch
Box
Switch
Box
CU
Input
Output
Switch
Box
Switch
Box
Switch
Box
Switch
Box
Figure 3-4. Layout of intermediate fabric using optimized interconnect with CU I/Oconnected directly to adjacent switchboxes.
is that by reducing muxes, the fabric requires less configuration registers to store the
corresponding select values, which reduces flip-flop overhead while also improving
reconfiguration times. Although using 2-source virtual tracks reduces area, replacing the
3- and 4-source tracks used in previous fabrics is a significant challenge. In a traditional
island-style architecture, a track typically has 3-4 possible sources: 2 switch boxes and
1-2 CUs. If we eliminate the switch box connections, the track can only route between
adjacent resources, which significantly limits routability. Similarly, if we remove the CU
connections, then there is no way for routing to reach CUs.
To address this problem, we considered several significant modifications to
traditional fabrics. First, we started with 2-source tracks between adjacent switch boxes,
with each switch box as a possible source. However, that interconnect configuration
does not provide a mechanism for connecting CUs to the routing tracks. We could have
21
S In
W In
NE O
ut
Reg
NW
Out
Reg
N O
ut
Reg
E OutRe
g
S O
ut
Reg
E In
SW In
put
SE Input
W Out
Re
g
N
W
S
SW
N
E
S
SE
E
W
SW
N
SE
SW
NWE
SSE
E
W
SW
N
SE
NS
S In
W In
N O
ut
Reg
E OutRe
g
S O
ut
Reg
E In
W Out
Re
g
N
W
S
N
E
S
E
W
N
E
W
S
N In N In
a) b)
Figure 3-5. Switch box topologies for (a) previous intermediate fabric interconnect and(b) the presented interconnect with diagonal CU channels.
added connection boxes, but that would violate the 2-source restriction. Therefore,
we considered adding additional channels to each switch box with direct connections
to the CU I/O. The overall fabric layout for this optimized virtual interconnect is shown
in Figure 3-4. As illustrated, in this unconventional fabric, no virtual track has more
than 2 sources, which eliminates all muxes previously needed to implement tracks.
One challenge in designing this optimized interconnect is that although we eliminated
track muxes, we added additional muxes inside of the switch boxes to support the
additional CU channels. Unless the switch boxes add fewer muxes than we removed
from the tracks, this optimization does not reduce area. To ensure that the optimized
interconnect reduces LUT usage, we exploit the internal characteristics of the switch box
to handle the additional routing requirements with minimal logic. Previous intermediate
fabric switch boxes use a planar topology, where each output from the switch box uses
a 3:1 mux that selects an input from one of the three other channels, as shown in
Figure 3-5(a). For the new interconnect, these multiplexors could potentially require
four more inputs to handle routing of the four adjacent CUs, which would significantly
outweigh track savings. However, we can exploit the fact that increasing mux inputs
22
8
16
32
64
128
256
512
2 3 4 5 6 7 8
# of
4-In
put L
UTs
-(lo
g2 s
cale
)
# of Mux Inputs
16-bit
32-bit
64-bit
Data Width
Figure 3-6. Virtex 4 LX100 multiplexer LUT usage for varying MUX input counts. Theplateaus provide opportunities for switch boxes to add more connectionswithout an area penalty.
does not always increase LUT requirements. As shown in Figure 3-6, FPGAs have
different area plateaus where additional mux inputs have the same LUT requirements as
lesser inputs (e.g., 3-4 inputs and 6-8 inputs). The optimized interconnect exploits this
characteristic by adding CU I/O connections to the muxes until reaching the largest input
size of a plateau, which maximizes routability without any increase in area. Interestingly,
the presented interconnect can be specialized for different physical FPGAs, which have
different mux plateaus due to varying LUT sizes.
Although the optimized interconnect switch boxes are not restricted to a specific
topology, we choose a planar-like topology for evaluation and target the mux plateaus for
4-input muxes. Therefore, the switch boxes increase 3-input muxes to 4 inputs wherever
possible. The switch boxes also use 5-input muxes, but do not increase the inputs to
6 or more, despite the plateau between 6 and 8 inputs. Increasing the mux inputs to 8
may improve routability with additional overhead, but we defer such analysis to future
work. An example topology is shown in Figure 3-5(b), where the switch box provides
a planar topology for the north, east, south, and west channels, which correspond to
23
virtual tracks. In this example, the CU channels (southeast, southwest, northwest,
northeast) connect to the other channels in customizable ways. Note that we are not
proposing a specific switch box topology for the optimized interconnect. Instead, like
any intermediate fabric, we expect the topology to change based on application and
routability requirements. For the applications we evaluated, using a highly directional
fabric was beneficial due to pipelined, feed-forward datapaths. However, the switch
box can easily be customized for other topologies. In the experiments, we use a fabric
generation tool that allows specification of the exact switch box topology in a fabric
description file.
3.3 Experiments
In this section, we compare intermediate fabrics using the presented virtual
interconnect with previous work [11][42]. Section 3.3.1 describes the experimental
setup. Section 3.3.5 compares area requirements, clock speedups, and routability
of both approaches for unspecialized, uniform fabrics. Section 3.3.6 presents similar
experiments for application-specialized fabrics.
3.3.1 Experimental Setup
This section describes the intermediate fabric tool flow used for the experiments
(Section 3.3.2), along with the routability measurements (Section 3.3.3), and the tools
used for evaluating the different interconnects (Section 3.3.4).
3.3.2 Tool flow
To implement applications on the intermediate fabrics, we manually synthesize
circuits by creating technology-mapped netlists. We plan to convert open-source
synthesis tools to target intermediate fabrics, including OpenCL high-level synthesis,
but such a project is outside the scope of this paper. For place-and-route, we use the
algorithm previously described in [11] to ensure that the comparison between the new
and previous interconnect is not unfairly skewed by improved placement. In fact, the
place-and-route results for the new interconnect are likely pessimistic because we
24
did not modify the placer cost function for the new interconnect. The place-and-route
algorithm is a variation of VPR [6], and uses simulated annealing for placement with a
cost function that minimizes bounding box size. Routing uses the well-known PathFinder
[36] negotiated-congestion algorithm. Both the new and previous interconnect have
varying amounts of pipelining in switch boxes or on tracks. Instead of using pipelined
routing algorithms (e.g., [16], both approaches use realignment registers in front of each
CU to balance the routing delays of all inputs. Because this pipelining strategy only
works for pipelined datapaths that can be retimed without affecting correctness, we limit
the evaluation to fabrics with coarse-grained resources commonly needed by datapaths
in signal processing. To configure the intermediate fabric for different applications, the
place-and-route tool outputs a configuration bit file that we store in a block RAM on the
targeted FPGA. Each intermediate fabric includes a programmer which loads the bitfile
from the block RAM by shifting bits into virtual configuration registers that control the
CUs and virtual switch boxes.
3.3.3 Routability Metric
To fairly compare tradeoffs between interconnects, it is necessary to measure
routability. To perform these measurements for a given intermediate fabric, we
place-and-route a large number of randomly generated netlists of varying sizes, and
determine the routability score of the interconnect based on the percentage of netlists
that route successfully. Due to the fast place-and-route time for intermediate fabrics we
were able to test 1,000 netlists for each fabric to obtain a high-precision metric. The
random netlist generator creates directed acyclic graph structures representative of
pipelined datapaths. Based on the CU composition of each individual fabric tested, the
generator creates a random number of datapath stages, each consisting of a random
number of technology-mapped cells, and creates random connections between each
stage. Each stage contains at minimum enough cells, and enough connections are
made between stages, such that each cell has at least one path to the next stage. This
25
method results in netlists containing one or more disjoint pipelines of one or more stages
each.
3.3.4 Interconnect Evaluation
To evaluate different interconnects, we developed a tool capable of generating
VHDL for intermediate fabrics using the new interconnect. The tool takes as inputs a
fabric-description file that defines the parameters of the fabric, such as size, aspect ratio,
bit-width and the makeup of the fabric, including CU composition, and row and column
channel descriptions. Channel descriptions include number of tracks, direction of each
track, and switchbox topology.
To obtain physical FPGA utilization and timing results, we synthesized the
intermediate fabric VHDL using Xilinx ISE 10.1, Synopsys Synplify Pro 2012, and
Altera Quartus II 10.1, depending on the targeted FPGA. To evaluate the effects of
FPGA variation on each virtual interconnect, we implemented intermediate fabrics on
Xilinx Virtex 4 LX100 and LX200, Xilinx Virtex 5 LX330, and Altera Stratix IV E530
FPGAs. The intermediate fabric HDL synthesized for each test case uses the fixed-logic
multipliers available on each physical device for all CUs (Xilinx DSP48s and Altera
18x18 Multipliers); therefore all device utilization represents the LUT and flip-flop
overhead of implementing the target application via an intermediate fabric rather than a
direct HDL implementation.
3.3.5 Interconnect Comparison for Uniform Intermediate Fabrics
In this section we compare area, routability, and maximum clock speed of
intermediate fabrics using the presented interconnect to intermediate fabrics using
interconnect previously presented in [11] and [42]. We evaluate each interconnect
using different fabric sizes, implemented on several different physical FPGAs. Although
intermediate fabrics can be specialized to an application, in this section we evaluate
fabrics independently of targeted applications by using a uniform fabric consisting of
16-bit DSP CUs with various dimensions (e.g., 5x5 = 5 rows and 5 columns of I/O and
26
Table 3-1. A comparison between the presented virtual interconnect (New) and previousuniform virtual interconnect (Prev).
CUs). Table 3-1 compares LUT and flip-flop utilization (as a % of total device resources),
routability of 1000 randomly generated netlists, and maximum clock speed for identical
intermediate fabrics using the new and previous interconnects. We implemented fabric
sizes between 3x3 and 12x8 on a Virtex 4 LX200, where an NxM fabric is composed
of one row of M inputs, N-2 rows of M CUs, and one row of M outputs. We evaluated
larger fabric sizes of 13x13 and 14x14 on a Virtex 5 LX330, and sizes 15x15 and 16x16
on a large Stratix IV E530. For fabrics using the previous interconnect, we used 3
16-bit tracks per channel with specialized connection boxes from [11], as previous work
indicated this configuration to be an effective tradeoff between routability and overhead.
For fabrics using the new interconnect, we used 2 16-bit tracks per row and 4 tracks
per column with the switch box topology described in Section 3.2 optimized for 4-input
muxes.
These results show the LUT and flip-flop utilizations of the new interconnect are
significantly less than the previous interconnect, with an average LUT savings of 54%
and flip-flop savings of 59% for the fabrics evaluated. Note that we were unable to
synthesize the old interconnect on the Stratix IV device. We tried three different version
of Quartus, but the old interconnect would cause a crash during the retiming stage
of synthesis. For this reason, we exclude the Stratix IV results from the averages.
Additionally, the new interconnect showed significant maximum clock frequency speedup
27
for larger fabrics. When implemented on the Virtex 4, new interconnect clock speeds
decreased only 6.3% between fabrics of size 3x3 to 12x8, whereas the previous
interconnect suffered from a 34.7% decrease in clock speed over the same range.
Overall, the new interconnect averaged 167 MHz compared to 136 MHz. The new
interconnect did incur a routability penalty, with a average decrease of 16% compared
to the previous interconnect. While this overhead is a potential limitation of the new
interconnect, especially when applied to a general-purpose fabric, we believe this
overhead to be an acceptable tradeoff when compared to the significant area savings
provided by the new interconnect. Routability overhead can also be easily compensated
for when designing the CU composition of a fabric. Because the placer algorithm used
in these experiments is unchanged from that used for the old fabric, it is likely that an
appropriately customized placer cost function would significantly improve the routability
of the new interconnect. Similarly, fabrics using the new interconnect could account for
decreased routability by including many more routing resources while still saving area.
Routability decreased monotonically with increased fabric size due to the increased
difficulty of routing larger netlists. The one exception was the 3x3 fabric with the new
interconnect, which had lower routability than the larger fabrics. We identified the source
of this problem as limited connections between I/O and CUs for very small fabrics using
the new interconnect. Because we expect 3x3 to be an unusually small size for actual
usage, this overhead is not a significant limitation. These results also show decreased
LUT overhead savings of only 46% in fabrics implemented on the Virtex 5 device. This
smaller improvement is likely due to different CLB configuration used by that device,
with slightly altered mux-area plateau characteristics, whereas the optimizations used by
the evaluated interconnect were optimized for 4-input muxes. Despite being optimized
for a different LUT configuration, the new interconnect still had significant savings.
Flip-flop usage on the Altera device was significantly higher than both Xilinx devices,
which resulted from the Xilinx FPGAs implementing the realignment registers as SRL16
28
primitives, in contrast to the Altera FPGA which used flip-flops. As future work, we
will investigate optimizations for Altera FPGAs. One additional advantage of reducing
muxes throughout the interconnect is the corresponding elimination of configuration
registers to store the select values. The fewer registers reduce flip-flops, which was
shown in Table 3-1, but also reduces configuration bitfile size, which correspondingly
reduces configuration times and block RAM overhead of the fabric. For the examples in
this section, the new interconnect improved configuration times by an average of 55%
compared to the previous interconnect.
3.3.6 Interconnect Comparison for Specialized Intermediate Fabrics
One advantage of intermediate fabrics is that a designer or tool can specialize the
architecture and interconnect for a given domain or even an individual application. In
this section, we compare intermediate fabrics using application-specialized interconnect
presented in [11] with the new interconnect. To enable a fair comparison, we evaluate
the same application circuits from [11] using the same specialized fabrics as previous
experiments. Specialization used in the previous experiments included varying fabric
sizes and non-uniform interconnects. For the new interconnect, we limit specialization
to fabric sizes, making the results pessimistic. For all specialized fabrics, we used the
smallest fabric and interconnect that could successfully route the target application
netlist. For these experiments, the physical FPGA is a Virtex 4 LX100, which we
chose to match the previous experiments. To perform the comparison, we used the
twelve applications from [11], seven of which were implemented using both 16-bit fixed
point arithmetic and 32-bit floating point arithmetic, indicated with a FXD or FLT suffix
respectively. All track widths matched the CU widths. All circuits without a suffix used
16-bit fixed-point CUs. We briefly summarize the previous applications as follows.
Matrix multiply performs the kernel of a matrix multiplication, calculating the inner
product of two 8-element vectors using 7 adders and 8 multipliers. FIR implements
a 12-tap finite impulse response filter in transpose form with symmetric coefficients
29
using 11 adders and 12 multipliers. N-body, representing the kernel of an N-body
simulation, calculates the gravitational force exerted on a particle due to other particles
in two-dimensional space using 13 adders, multipliers, and a divider. Accum monitors a
stream, counting the number of times the value is less than a threshold. It is the smallest
netlist, consisting of 4 comparators and 3 adders. Normalize normalizes an input stream
using 8 multipliers and 8 adders. Bilinear performs bilinear interpolation on an image,
requiring 8 multipliers and 3 adders. Floyd-Steinberg performs image dithering using
6 adders and 4 multipliers. Thresholding performs automatic image thresholding using
8 comparators and 14 adders. Sobel uses a 3x3 convolution to perform Sobel edge
detection with 2 multipliers and 11 adders. Gaussian blur uses a 5x5 convolution to
perform noise reduction using 25 multipliers and 24 adders. Max filter performs a 3x3
sliding-window image filter with 8 comparators. Mean filter similarly calculates the
average of a sliding window, which we vary from 3x3 to 7x7, requiring a maximum of 48
adders and 1 multiplier. Figure 3-2 compares the interconnects for each case study. The
first major column, Place-and-Route Time, compares place-and-route execution times
for an intermediate fabric with the previous interconnect (IF Prev), an intermediate fabric
with the new interconnect (IF New), and when synthesizing VHDL for each example
directly to the FPGA. The table also shows the resulting place-and-route speedup for the
new and previous interconnects. The results show comparable place-and-route times
for both the old and new interconnect. However, because the previous interconnect
already achieves a place-and-route speedup of 554x compared to an FPGA, the further
improvement by the new interconnect provided a 1350x place-and-route speedup.
The place-and-route speedup was larger for the floating-point examples due to longer
place-and-route times for the physical FPGA. Furthermore, these place-and-route
speedups are highly pessimistic because the specialized examples from [11] do not
include common board logic such as PCIe and memory controllers. Other studies
have shown that including these controllers with tight timing constraints can add up to
30
Table 3-2. A comparison between intermediate fabrics (IFs) with the presented virtualinterconnect (IF New) and previous application-specialized interconnect (IFPrev).
Place-and-Route Time Area and Routability Clock Speed
20 minutes to FPGA place-and-route time, but have no effect on intermediate fabric
place-and-route time [42].
The second major column in Figure 3-2 reports area savings of the new interconnect
in terms of FPGA LUTs and flip-flops, along with the routability overhead incurred to
achieve these savings. On average, the new interconnect significantly reduced LUT
usage by 48% and flip-flop usage by 46%, despite the significant specialization by
the previous fabrics. On average, routability slightly improved by 8% with the new
interconnect. However, this average is skewed by three outliers, normalize, Gaussian,
and mean7x7, which had very low routability due to significant specialization in the
previous fabrics. Excluding these outliers, the new interconnect had a 2% routability
overhead. The smaller routability overhead compared to the previous section is due
to the specialized versions of the previous interconnect, which used just enough
31
routing resources to route the targeted application, and therefore lowered general
routability. The final column of Figure 3-2 compares the maximum clock speed of
the specialized fabrics using both the new and old interconnect. For specialized
fabrics, these experiments show a negligible average impact on clock speed, with
both interconnects showing an average clock frequency of 186 MHz. However, there
was significant variation as high as 21% between specialized fabrics. It should be
noted that these results are contrary to the results for larger fabrics presented in the
previous section, which showed a clear trend of faster clock speeds for larger fabrics
using the new interconnect. The reason for the smaller clock improvement compared to
the previous section is due to the higher specialization of the previous interconnect, as
opposed to using a uniform interconnect.
32
CHAPTER 4PSEUDO CONSTANT LOGIC OPTIMIZATION
This chapter discusses a bottom-up approach approach to reducing intermediate
fabric interconnect area overhead. Specifically, we seek to reduce the area consumed
by each of the many multiplexers that compose the interconnect. The optimizations
presented below are complementary to the top-down architectural approach presented
in the previous chapter, seeking to limit the number of multiplexers used in the fabric.
FPGA logic optimization is a widely studied topic, with dozens of existing optimizations
that build upon decades of digital-design research [3][7][9][18][25][38][39]. A common
strategy involves iteratively propagating constants while performing logic minimization
(i.e., constant folding [19]). For example, Figure 4-1(a) shows a 4:1 multiplexer, which
a synthesis tool may map to three 4-input lookup tables (LUTs). In some situations,
as shown in Figure 4-1(b), a constant may propagate to the mux?s select input, which
simplifies the logic to a wire.
Unfortunately, constant-based optimizations have limited applicability. For example,
circuit designers often avoid constant inputs to enable support for as many use cases as
possible [46]. However, we have observed that circuits commonly include signals that
exhibit near-constant behavior where the signal value is rarely changed, which we define
as pseudo-constant. For example, many signal-processing applications initially set a
pseudo-constant convolution kernel, which remains the same for the duration of the
application. Alternatively, each frame of a low frame-rate video may also be considered
pseudo-constant. These pseudo-constant values are often inputs to common logic
components such as adders, multipliers, comparators, and muxes (e.g., [7][29]), which
could potentially benefit from constant folding to reduce area and/or increase replication.
We introduce pseudo-constant logic optimization, which is conceptually similar
to traditional constant folding, widely used in static logic optimization. However,
when a pseudo-constant changes values at runtime, the optimized logic becomes
33
LUT
LUT LUT
i0 i1 s0 i2 i3 s1
s1
Technology-Mapped Circuit
s0
constant S = �01✁
i1
o0
LUT
i0 i1 i2 i3
o0pseudo-constant S = �01✁
With LUTRAM
i0 i1 i2 i3
o0
WithPartial
Reconfiguration
o0
i0 i1 i2 i3
s0s1
o0
+ requires 0 LUTs- requires statically known constant
+ requires 1 LUTs- must be reconfigured on input change (fast)
+ supports all inputs- requires 3 LUTs
+ requires 0 LUTs- must be reconfigured on input change (slow)
Circuit
(a)
(b)
(c)
Figure 4-1. A comparison of constant propagation for a multiplixer with (a) anon-constant select, (b) a constant select, and (c) when usingpseudoconstant logic optimization for inputs that rarely change.
invalid. To prevent these invalidations from affecting correctness, we exploit FPGA
lookup-table (LUT) reconfigurability to dynamically modify the logic according to
the new pseudo-constant value. Although LUT reconfiguration causes performance
overhead, low-frequency invalidations often make this overhead insignificant. In general,
higher invalidation frequencies provide various tradeoffs between area savings and
performance overhead.
This chapter discusses the design process, implementation, and evaluation of
pseudo-constant logic optimization.
4.1 Pseudo-Constant Design Process
This section defines the four components of pseudo-constant logic optimization:
identification, technology mapping, bitfile creation, and invalidation detection.
34
4.1.1 Pseudo-Constant Identification
The first step of pseudo-constant logic optimization is the identification of potential
pseudo-constants. Our current approach uses designer-specified identification,
where designers use knowledge of an applications behavior to manually specify
pseudo-constant signals. Rather than requiring the actual value of a pseudo-constant,
a designer or synthesis tool need only know that a signal will be pseudo-constant (e.g.,
a convolution kernel). Although designers may often be aware of pseudo-constants,
there will be situations where potential pseudo-constants are not obvious. Synthesis
tools could potentially use a profiling-based heuristic that profiles the number of distinct
values of a given signal, along with the frequency that the value changes (i.e., the
invalidation frequency). Furthermore, we envision a hybrid approach where designers
specify the signals to profile. Previous work [28] has introduced such profiling for both
simulation and in-circuit behavior. We plan to investigate automatic pseudo-constant
identification as future work.
4.1.2 Pseudo-Constant Technology Mapping
One important difference between pseudo-constant and traditional logic optimization
is that the elaborated circuits may be identical, but may differ significantly after
technology mapping. For example, consider the 4:1 multiplexer from Figure 4-1, with a
constant or pseudo-constant select input. For this example, logic optimizations would
replace the multiplexer with a wire connected to the mux input that corresponds to
the selects constant/pseudo-constant value. However, depending on the available
FPGA primitives, technology mapping for the pseudo-constant logic may require more
than a wire because the resulting circuit must handle changes caused by invalidated
pseudo-constants. To deal with these invalidations, any logic that is optimized in the
previous step is marked as pseudo-constant logic, which technology mapping handles
differently from normal logic. Technology mapping for pseudo-constant logic is similar to
traditional technology mapping, but is restricted to FPGA primitives that support runtime
35
reconfiguration. Although there could potentially be numerous primitives, in this paper
we focus on common primitives in existing FPGA devices: LUT RAM and LUT shift
registers. Section 4.2 describes example mappings for Xilinx devices. Pseudo-constants
are also possible on Altera devices using MLABs, but evaluation of MLABs is outside
the scope of this paper. Previous work has focused on using partial reconfiguration for
similar goals [46], which we omit from this study due to the long reconfiguration times
compared to rewriting LUT contents. However, partial reconfiguration may represent a
Pareto-optimal tradeoff in terms of area savings and performance overhead. We plan to
investigate these tradeoffs as future work
4.1.3 Pseudo-Constant Bitfile Creation
After technology mapping, the resulting circuit must create and/or provide a small,
corresponding bitfile that implements the logic for each pseudo-constant value. In the
case of LUT RAM or LUT shift registers, this bitfile is simply the truth table stored in
the LUT. For the mux example in Figure 4-1(c), the bitfile for the pseudo-constant mux
is 16 bits, due to the 4-input LUT (24 bits). The overhead of pseudo-constant bitfile
creation depends on the characteristics of a particular pseudo-constant, which may be
amenable to either offline or online creation. In this paper, we focus mainly on offline
creation, but discuss the tradeoffs and challenges of online creation to present all
possible use cases. Offline creation is possible when the designer or synthesis tool
is aware that a pseudo-constant only has several possible values. In this case, the
synthesis tool can pre-compute the bitfile for all possible values and store the bitfiles
in on-chip memory, which the circuit loads into the corresponding primitives during
a pseudo-constant invalidation. For example, for a 4:1 mux with a pseudo-constant
select, a synthesis tool could statically determine four separate bitfiles and store them
in a block RAM or other memory. Offline creation is not limited to functions with small
numbers of inputs. For example, an input to a 32-bit comparator may only have two
different possible values for a given application (e.g., a runtime-specified threshold in
36
an image-processing application), which would enable a synthesis tool to statically
create two separate bitfiles. Online bitfile creation is needed when a synthesis tool is
not aware of the different possible values of a pseudo-constant, or alternatively when
there are too many possible values, which would require a significant amount of on-chip
memory to store bitfiles. In general, online bitfile creation is more complicated and
requires a portion of the circuit, or a co-processor, to calculate truth tables for invalidated
pseudo-constant logic. In many situations, online creation is not practical because the
logic required for bitfile creation is larger than the savings from the pseudo-constant
logic. Note that pseudo-constant bitfiles also create memory overhead. We therefore
expect pseudo-constant optimization to be appropriate where block RAM is not the main
resource bottleneck of an application.
4.1.4 Pseudo-Constant Invatidation Detection
Pseudo-constant circuits must identify when a pseudo-constant changes values,
which we refer to as invalidation detection. After detecting an invalidation, the circuit
loads a new bitfile into the corresponding resources. In this paper, we use application-specified
detection, where the designer explicitly specifies when a given pseudo-constant
changes. One disadvantage is that this approach is error prone and requires knowledge
of pseudo-constant invalidations. However, for many applications, invalidations are
obvious. For example, for designer-specified pseudo-constants, the designer is already
aware of pseudo-constants, and is likely aware of when the application changes a
pseudo-constant (e.g., a new image). As future work, we envision the possibility of
runtime detection, which doesnt require designer knowledge, but is often impractical due
to overhead. In the general case, runtime detection requires a comparator, which may
outweigh savings except for large regions of pseudo-constant logic (e.g., large adder
trees).
37
WE
CLK
Shift In 1
WA8
WA2
WA1
A4
A3
A2
A1
A6
A5
A6A1-A5
A1-A5
Logic
Inputs
Write
Address
Inputs
Write
Ad
dre
ss D
eco
de
r
Re
ad
De
co
de
r
Shift Out
Write Data In,
Shift In 2
32x1
RAM
16-bit
Addressable
Shift Register
Re
ad
De
co
de
r
32x1
RAM
16-bit
Addressable
Shift Register
Read
Address
Inputs
Logic
Outputs
O6
O5
Figure 4-2. Functional architecture of a Xilinx Virtex 5 LUT. Each LUT can be configuredas a 64x1 dual-ported RAM, a single variable-length shift register up to32-bits long, or two independent variable-length shift registers up to 16-bitslong each.
4.2 Technology Mapping
In this section, we discuss how different pseudo-constant logic can be technology
mapped onto FPGA LUT primitives. In Section 4.2.1, we present pseudo-constant
primitives for the Xilinx Virtex 5. In Section 4.2.2, we identify architectural bottlenecks
and present extensions that would enable Virtex 5 to better support pseudo-constants.
4.2.1 Pseudo-Constant Primitives for Xilinx Virtex 5
General-purpose logic resources in Xilinx Virtex 5 devices are composed of
columns of configurable logic blocks (CLB). Each CLB is composed of two SLICEs,
each of which contains four LUTs. While devices are composed of equal numbers
of two different SLICE types, SLICEM and SLICEL, only SLICEMs have dynamically
reconfigurable LUT primitives; therefore, in this work we consider only SLICEMs.
Figure 4-2 shows the simplified functional architecture of the Virtex 5s six-input,
two-output LUT. The LUT is logically composed of two five-input, one-output (32x1)
38
random-access FIFO structures, addressed by the LUTs lower five inputs (A1:A5).
A mux uses the sixth input (A6) to select one of the lower inputs to drive the LUTs
primary O6 output, while one directly drives a secondary output O5. Output O6 may
be any combinational function of all six inputs, while O5 is a subset function of only five
inputs. Each Virtex 5 LUT can be configured as either a 64x1 or 32x2 dual-ported RAM,
with one synchronous write and one asynchronous read port. An additional six inputs
(WA[1:6]) specify the write address. Each LUT can also be configured as one 32-bit
shift register, or two 16-bit shift registers, each with addressable outputs that can select
any bit of the shift register. Xilinx refers to these shift register primitives as SRL32 and
SRL16 respectively. Figure 4-3 shows a simplified view of a SLICEM from the Virtex 5
user guide [2]. Each SLICEM contains four LUTs, referred to as A, B, C, and D. Each
LUT has six dedicated logic or read address inputs, as well as two data inputs to drive
LUT RAM and SRL inputs. LUT Ds read address inputs (D1:D6) are also used to drive
the six write address inputs for all four LUTs. As discussed later, this addressing method
is a limitation for pseudo-constant logic because LUT D cannot be efficiently used for
outputs.
Paired with each LUT is dedicated carry-chain logic and a flip-flop. Dedicated
outputs carry each LUT O6 output and flip-flop output to the routing fabric. Muxes select
from the LUT O5 output, carry-chain output, and shift-out as well to drive the flip-flop
input and a third output. The shift-out port of each LUT is connected to the shift-in port
of the next LUT (i.e., Shift Out D ? Shift In C) to create longer shift register chains, up
to 128 bits per SLICEM. Two dedicated muxes select between outputs from LUT A and
B, and LUT C and D, and a third mux selects between those two muxes. This structure
enables eight-input muxes using only two LUTs, and 16-input muxes using four LUTs.
4.2.1.1 Distributed RAM
To implement the LUT RAM pseudo-constant primitive, we use Xilinx Distributed
RAM. Each Xilinx LUT allows read and write access to the 64 SRAM bits in either
39
D
A [1:6]
WA [1:6]
D [1:6]
DI
O6O5
DIN 1
DIN 2DX
C
A [1:6]
WA [1:6]
C [1:6]
CI
O6O5
DIN 1
DIN 2CX
B
A [1:6]
WA [1:6]
B [1:6]
BI
O6O5
DIN 1
DIN 2BX
A
A [1:6]
WA [1:6]
A [1:6]
AI
O6O5
DIN 1
DIN 2AX
64x1 or 32x2
Dual Ported
RAM
256x1 Single Port
or
64x1
1xRD/WR, 3xRD
Can be
configured as
three reconfigurable
6:1 or 5:2 functions
LUT D is
consumed by
write address
inputs
Figure 4-3. All four LUTs (A-D) of a single Xilinx Virtex 5 SLICEM configured asdistributed RAM.
64x1-bit or 32x2-bit dimensions. Multiple LUTs per slice can be grouped together to
create wider or deeper memories. Because the write addresses for the four LUTs
are driven by LUT Ds six logic and read inputs, significant limitations are placed
on the efficiency of LUT RAM structures when using the Virtex 5. For example, a
dual-ported 64x1 RAM requires two LUTs, resulting in a 50% area penalty. To achieve
maximum area efficiency, a LUT RAM primitive using Virtex 5 distributed RAM should
ideally use all four LUTs in a single SLICEM. Inputs D[1:6] drive the common write
address and are used to configure LUTS A, B and C, which can then be used as three
independent LUTs, while LUT Ds inputs are consumed by serving as the write-address
for LUTs A, B, and C. Figure 4-3 shows four LUTs connected in this fashion. Using LUT
RAM, each SLICEM yields either three 6-input, 1-output functions, or three 5-input,
2-output functions. Because only one flip-flop is available to each LUT, in the case of
2-output functions, only one output can use a flip-flop, which can be a limitation for
40
pipelined logic. If inputs D[1:6] can be driven by both logic during normal operation
and configuration hardware during reconfiguration, then LUT D may also be used
for pseudo-constant based logic, eliminating the penalty. In this case, four 6:1 or 5:2
functions could be realized per SLICEM.
4.2.1.2 Shift Register
LUT shift-register primitives can be implemented using Xilinx SRL primitives.
When configuring LUTs as shift registers, configuration bits for many LUTs can be
shifted serially in a single configuration chain. Using the SRL32, a single LUT can be
configured as a five-input, one-output function. Configured as two SRL16s, each LUT
can be configured as a four-input, two-output function. Unlike SRL32, each SRL16 must
be driven by an independent configuration input; multiple SRL16 primitives cannot be
chained together in a single, long configuration chain.
4.2.2 Architectural Extensions
The pseudo-constant primitives for the Virtex 5, described above, highlight many
of the challenges of implementing pseudo-constant logic optimization on modern
FPGAs. In this section, we discuss the FPGA architectural characteristics that most
limit the effectiveness of pseudo-constant logic optimizations, particularly those of the
Xilinx Virtex 5 CLB architecture, and suggest modifications to improve the efficiency of
pseudo-constants. Pseudo-constant implementations can be viewed as a traditional
input/output-bound problem. The Virtex 5 implementations described above show
that the number of inputs and outputs to an FPGAs LUTs are a key limitation of
pseudo-constant logic packing and place an upper bound on the achievable area
reduction. Additionally, the number of inputs required to produce a given output value,
and the number of inputs shared among multiple outputs, enforce a similar limitation.
For example, in the design of an adder circuit described in the next section, the key
design limitation was the number of outputs from a LUT. While groups of four to six
input pins, producing three to five sum outputs and a carry, could drive a single LUT, at
41
most two outputs per LUT could be generated. Additionally, the availability of only one
set of fast carry logic and flip-flop per LUT, limits the achievable maximum clock speed
when using two outputs per LUT. Additionally, with LUT RAM primitives, one LUT per
SLICE is consumed solely by the use of its address pins by the RAM write address,
and cannot be used for logic. As result of these challenges, we suggest that future
FPGA architectures could be augmented to improve the viability of pseudo-constant
optimizations. For example, we have observed that modifications to improve efficiency
of wider-output functions, such as those found in many arithmetic operations, could
greatly benefit pseudo-constant optimizations. Particularly, more outputs per LUT
and a fast carry logic and flip-flop pair for each LUT output, could greatly improve the
efficiency of wide- or multi-output functions. By adding an extra set of address pins to
the SLICE to serve as the common write address input, the 25 percent loss of functional
density in LUT RAM based designs can be averted. Figure 4-4 shows a possible SLICE
architecture for a device including these modifications. Specifically, this devices SLICE
architecture is composed of four six-input, two-output LUTs identical to those of a
Virtex 5. Additionally, carry-logic and flip-flop stages identical to those available to the
O6 output of Virtex 5 LUTs are added to both outputs of each LUT. An additional fifth
set of six input pins is added to serve as a common write-address input for LUT RAM
primitives.
4.3 Experiments
To evaluate pseudo-constant logic optimization, we manually technology mapped
common logic functions onto pseudo-constant primitives for Xilinx Virtex 5 FPGAs.
Because Virtex 5, Virtex 6, and Virtex 7 devices all employ an identical CLB architecture,
the results also apply to those devices. To determine benefits, we also synthesize each
circuit without the proposed optimization to a Xilinx Virtex 5 LX50 FPGA using Xilinx ISE
14.2. For each example, we include results for only the most efficient of either a LUT
RAM or shift-register-based design. In addition to the Virtex 5, we evaluate the same
42
B OUT
AQ1
AQ2
BQ1
BQ2
CQ1
CQ2
DQ1
DQ2
A OUTAI
B [1:6]
A [1:6]
LUT A
BI
Carry Logic
Carry LogicLUT B
Carry Logic
Carry Logic
CI
D [1:6]
C [1:6]
LUT C
DI
Carry Logic
Carry LogicLUT D
Carry Logic
Carry Logic
E [1:6]Dedicated Write
Address inputs
Two Flip-Flops
per LUTTwo carry
stages per LUT
Flip-Flop B1
FF B2
Flip-Flop A1
FF A2
Flip-Flop D1
FF D2
Flip-Flop C1
FF C2
C OUT
D OUT
MUX
Figure 4-4. A modified Virtex 5 slice including enhancements to improve efficiency ofpseudo-constant optimized logic. An extra carry logic and flip-flop outputstage is added to the secondary O5 output of each LUT, and a fifth set ofaddress pins is added for dedicated write address use.
circuits on a theoretical device incorporating the modifications proposed in Section 4.2.2.
This theoretical device is composed of CLBs using the modified Virtex 5 architecture
shown in Figure 4-4, for which we assume Virtex 5 timing and switching characteristics
[1]. Note that the timing results are optimistic because the theoretical architecture
may have longer delays. Similarly, the theoretical device may have general-purpose
area tradeoffs, which are outside the scope of this study. We also evaluated support
logic to reconfigure pseudo-constant circuits. We implemented a simple programmer
to iteratively shift each bit of the pseudo-constant bitfile into the configuration bits of
each LUT. This circuit is composed of a counter and a BRAM to store the bitfile, and
consumes as few as 10 LUTs. Therefore, pseudo-constant optimizations are only
beneficial after savings more than 10 LUTs. This circuit can be used to program each
LUT sequentially, or replicated to program two or more in parallel.
In this section, we evaluate logic that is commonly replicated in large numbers by
many FPGA applications. These replicated circuits represent an appropriate usage
43
case for pseudo-constant circuits as many copies can share a small number of support
circuits for reconfiguration and invalidation. This sharing enables overhead of the
pseudo-constant support logic to be amortized over a large number of optimized circuits.
The evaluated circuits include an adder, a comparator, and a multiplexer.
4.3.0.1 32-bit Full Adder
When synthesized into FPGA LUTs, adder circuits are output-bound. Because
addition operations are wide-output functions, the key challenge in minimizing the
number of LUTs is in driving all N outputs. Synthesis in Xilinx ISE for a Virtex 5 adder
uses the dedicated fast carry logic to create ripple-carry adders. Each LUT adds the ith
bit of each input A and B, generating a sum and carry output. These outputs drive the
carry logic, which combines these signals with Ci-1 to generate a Si and Ci. If instead
the add operation had one pseudo-constant input and one normal (i.e., non-constant)
input, the pseudo-constant value can be folded into the function implemented by each
LUT to reduce the LUT utilization. In this case, the only input to each LUT would be the
ith bit of the non-constant input. Even though each LUT has several free inputs, because
both sum and carry-out signals must be generated for each bit, both LUT outputs are
consumed, and no LUTs can be eliminated from the circuit. Suppose instead three bits
of the non-constant input, [Ai,,Ai-2], along with a carry input Ci-3, were connected to
two LUTs. The four available outputs from this structure can then implement outputs [Si,
Si-2] and Ci. Because each previous bits inputs are available to the function generator
for each output bit, the internal carry values can be calculated without consuming LUT
outputs. This structure implements a 3-bit full adder using only two LUTs, rather than
three, providing a 33% area savings.
Figure 4-5: Using the SRL16-based four-input, two-output pseudo-constant
LUT primitive described above, many such pseudo-constant 3-bit full adders can be
chained together to implement wider pseudo-constant adders. Figure 4-5 shows a
32-bit adder designed using these structures. When synthesized for a Virtex 5 device
44
LUT1
LUT2
LUT3
LUT22
SRL16
SRL16
S[3]
S[2]
S[1]
A [0]S[0]SRL
16
SRL
16
SRL
16
SRL
16
Carry
Out 3
Cout
S[31] A [31]
LUT4
S[4]
SRL16S[5]
LUT5
S[6]SRL16
SRL16Carry
Out 6
Carry
Out 27
SRL
16
SRL16
A [1]
A [2]
A [3]
LUT20
S[28]
SRL16S[29]
LUT21
S[30]SRL16
SRL16Carry
Out 30
SRL16 A [30:28]
A [6:4]
Figure 4-5. An SRL16-based pseudo-constant 32-bit full adder design.
using Xilinx XST, a normal 32-bit adder consumes 32 LUTs. When synthesized using
the pseudo-constant based design, a 32-bit adder consumes only 22 LUTsan area
savings of 31%. Figure 4-6 shows how adder LUT count grows with input width for
both the traditional and pseudo-constant adders. The darker line shows how traditional
adder LUT count grows linearly, equal to the bit width. The lighter line shows how
the pseudo-constant adder LUT count grows slower and in a step-wise fashion. The
step-wise behavior, and LUT savings, is due to the fact that every other LUT generates
two bits of the output. The figure also shows that LUT savings increases as adder width
increases.
Because the Virtex 5 CLBs fast carry logic is accessible by only one output from
each LUT, the optimized design cannot benefit from the fast carry logic. Despite a
shorter overall combinational path, 11 logic stages rather than 32, the longer path
between neighboring LUTs increases the circuits combinational delay by 5x, from 2.515
45
0
4
8
12
16
20
24
28
32
36
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
LU
T C
ount
Adder Bit WidthTraditional Pseduo-Constant
Figure 4-6. A graph of LUT counts for pseudo-constant and traditionally synthesizedadders on a Xilinx Virtex 5 as adder bit width increases.
ns for traditional logic to 10.377 ns using the pseudo-constant design. Additionally,
because only one flip-flop is available per LUT, only one output from each LUT can
directly drive a pipeline register without consuming an additional route-through LUT.
When the pseudo-constant design is instead mapped onto the modified architecture
from Section 4.2.2, the 31% area savings are retained, while at the same time each
output bit can take advantage of the fast carry logic and flip-flop output stage. Thus, a
32-bit ripple carry adder can be mapped to the modified architecture using 22 LUTs with
a combinational delay of 1.343 ns. This delay for the pseudo-constant-optimized adder
is 47% faster than a traditionally synthesized adder.
4.3.0.2 Multiplexer
A pseudo-constant multiplexer can be designed similarly to the adder described
in the previous section. Using traditional synthesis methods, a four-input mux requires
one LUT on a Virtex 5. Multiple four-input muxes can be combined using dedicated
SLICE mux hardware to create up to one 16-input mux per SLICE. If the select input
to a mux were found to be pseudo-constant, using the SRL32 five-input, one-output
Figure 4-7. A graph of LUT counts for pseudo-constant and traditionally synthesizedmultiplexers on a Xilinx Virtex 5 as the number of inputs increases.
LUT primitive, a five-input mux consumes only one LUT, and a 20-input mux can be
created in each SLICE. This design yields a 25 percent increase in functional density
over traditional synthesis. Additionally, a four-input, two-output mux can be designed
using the SRL16 four-input, two-output LUT primitive consuming only one LUT, yielding
up to 50 percent LUT savings. By taking advantage of the LUT RAM-based primitive in
the modified architectures, a six-input, one-output mux can be created using just one
LUT, with up to a 24-input mux per SLICE. This design yields a 50 percent increase
in functional density over traditional synthesis, and a 25 percent increase over the
pseudo-constant design on the Virtex 5. Figure 4-7 shows the LUT count needed for
muxes implemented with each design as the number of inputs grows, up to 32 inputs.
The figure shows a step-wise trend due to LUT counts growing in different multiples.
The unoptimized mux increases in multiples of four inputs per LUT. The pseudo-constant
mux on the Virtex 5 increases in multiples five inputs per LUT. The mux on the modified
architectures increases in multiples of 6 inputs per LUT. As muxes grow larger, the LUT
savings achieved by pseudo-constant designs increases. There is no difference in timing
performance between each design.
47
LUT1
LUT2
A [0]
SRL
32
A [1]
A [2]
A [3]
A [4]
A [8:5]SRL
32
LUT3
A [12:9]SRL
32
LUT8
A [31:29]SRL
32 EQ
Figure 4-8. An SRL16-based pseudo-constant 32-bit comparator design.
4.3.1 32-bit Comparator
A pseudo-constant comparator can be designed similarly to the adder described
above. Suppose a circuit must compare two 32-bit numbers, A and B, for equivalence.
When synthesized to the Virtex 5 architecture, this circuit requires 11 LUTs, with a
propagation delay of 4.658 ns. If input B was found to be pseudo-constant, its value
can be folded into the function implemented by the circuits LUTs. Figure 4-8 shows
a comparator design using the SRL32-based five-input, one-output LUT primitive
described above. The inputs to each LUT are comprised of a group of four consecutive
bits of the variable input, along with a carry-out from the previous group. The outputs
from these groups are cascaded together to create a 32-bit wide comparator using only
8 LUTs, vs. 11 for an area savings of 27%. The propagation delay increased to 6.556
ns. By taking advantage of the modified architecture, using the six-input, one-output LUT
RAM primitive, the pseudo-constant comparator circuit can be optimized further. The
size of each group of inputs is increased from four to five. Thus a 32-bit pseudo-constant
comparator can be synthesized on the modified architecture using only 6 LUTs, yielding
a 20 percent area decrease and a 20 percent shorter combinational delay compared
to the pseudo-constant design synthesized on the Virtex 5. The resulting delay is only
2.7% slower than the traditional comparator.
48
4.3.2 Functional Density
In [46], Wirthlin et al. present a functional density metric, D, defined as the inverse
of the product of a circuits area, A, and operating time, T, as shown in Equation 4–1:
D =1
AT(4–1)
This metric is used to quantify the benefits of circuit specialization and enable
comparison of area and performance tradeoffs. Additionally, [46] presents a specialized
form of Equation 1 for use with run-time reconfigurable circuits such as pseudo-constant
optimized circuits. By adding reconfiguration time, tconfig, divided by operations per
reconfiguration, n, to the operating time term, the metric accounts for the performance
effects of reconfiguration operations at a given invalidation frequency. Equation 4–2
shows this modified metric.
D =1
A(texec +
tconfign
) (4–2)
Figure 4-9 plots the functional density, as defined by Equation 2, for each of the
three adder circuits. The figure shows the operations between invalidations (i.e., the
inverse of invalidation frequency) decreasing logarithmically. This figure shows that
while the combinational delay overhead on the Virtex 5 architecture prevents the
pseudo-constant circuit from matching the functional density of the traditional adder
circuit, on the modified architecture the pseudo-constant circuit surpasses the functional
density of the traditional adder after only 19 operations between reconfigurations.
Additionally, reconfiguration overhead per operation reaches nearly zero after only
214 operations, a small figure considering FPGA clock frequencies in the hundreds of
megahertz. For infrequent invalidations, the functional density of the pseudo-constant
adder on the modified architecture approached 2.7x. In any pseudo-constant design
using LUT RAM or shift-register LUTs, reconfiguration can load the pseudo-constant
bitfile into each LUT either in serial or in parallel. In serial reconfiguration, each bit
49
is written into each LUT one bit at a time, one LUT at a time. This method yields
the longest reconfiguration time and the largest performance penalty. Alternatively,
during parallel reconfiguration, each bit must still be written into each LUT one bit
at a time, but all LUTs can be written on the same cycle. Parallel reconfiguration
decreases reconfiguration time by a factor of N, where N is the total number of LUTs in
the pseudo-constant circuit. Because parallel reconfiguration requires proportionally
more reconfiguration resources, a designer or synthesis tool must consider an
area-performance tradeoff between parallel and serial reconfiguration. Additionally,
the degree of parallelism can be adjusted to find an appropriate Pareto-optimal design
point for each design.
Figure 4-10 compares the functional density of each pseudo-constant 32-input
mux to traditional muxes using either fully-parallel or fully-serial reconfiguration. Longer
dashed lines show parallel reconfiguration, and dotted lines show serial reconfiguration.
Lighter lines show pseudo-constant muxes implemented on the standard Virtex 5
architecture. Darker lines show pseudo-constant designs implemented on the modified
architecture. Functional density of intermediary degrees of parallelism can be inferred
between each trend line for each architecture. All densities are shown as a ratio to
functional density of a traditional Virtex 5 mux, shown in solid black.
The results show that pseudo-constant muxes approach a functional density
of 1.25x on the Virtex 5 architecture, and 1.5x on the modified architecture, when
compared to traditional synthesis. Additionally, the graph shows that the break-even
point, at which functional density of the pseudo-constant optimized and traditional
circuits are equal, is approximately 128 operations per invalidation using fully parallel
reconfiguration, and fewer than 900 operations using fully serial reconfiguration.
50
0
0.5
1
1.5
2
2.5
3
Fun
ctio
nal D
ensi
ty
Operations Between Invalidations
Pseudo-Constant on Modified Arch
Pseudo-Constant on V5
Traditional
Figure 4-9. Functional density of apseudo-constant addercompared to a traditional adderas the invalidation frequencyincreases. Results are shownfor both the Virtex 5 andmodified architectures.
Figure 4-10. Functional density of eachpseudo-constant mux designcompared to a traditional muxas the invalidation frequencyincreases. Functional densityfor each design is shown forboth fully-parallel andfully-serial reconfiguration.
51
CHAPTER 5CONCLUSIONS
Previous work introduced intermediate fabrics to address FPGA problems related
to lengthy place-and-route times and a lack of application portability. Although previous
intermediate fabric approaches achieve both application portability and significant
place-and-route speedup, the area overhead of those approaches prohibits important
use cases. To address this problem, we identified the virtual interconnect as the main
source of the overhead, and followed two complementary approaches to reduce
overhead.
After identifying multiplexers as the primary component of the interconnect, we first
performed design-space exploration to identify unconventional alternatives that could
achieve effective Pareto-optimal tradeoffs between overhead and routability. Based on
this analysis, we introduced an optimized virtual interconnect architecture that reduces
area requirements by approximately 50% and improves clock frequencies by 24%, with
a modest 16% reduction in routability.
Additionally, we sought to reduce the size of overhead due to each individual
multiplexer through pseudo-constant logic optimization. We showed that pseudo-constant
optimizations can increase functional density of common logic structures such as
multiplexers up to 1.25x. While these optimizations can apply to many other functional
elements, such as adders and comparators, the experiments also show the difficulty of
implementing pseudo-constant designs on modern FPGAs. In particular, restrictions
on dynamic reconfigurability and narrow-output functional units limit the effectiveness
of pseudo-constant optimizations. If future FPGA designs address these concerns,
pseudo-constant optimizations could be a viable method of increasing functional density
in FPGA designs, with improvements as high as 2.7x.
While these optimizations enable designers to employ intermediate fabrics in a
wider range of area-constrained applications, there is still opportunity for continued
52
improvement. Future work must address and limit the routability and flexibility penalty of
the optimized interconnect presented, as well as both the manual and automated design
challenges of integrating pseudo-constant logic optimizations. Even with a 50-75%
reduction in LUT utilization, intermediate fabrics will still have prohibitive overhead
for use cases where an FPGA is close to being fully utilized. Fortunately, the trends
towards multi-million-LUT FPGAs will lessen this problem over time. In addition, we plan
to investigate virtual interconnect that directly targets the physical FPGA interconnect
without using muxes. Such an approach could map virtual switch boxes directly onto
physical switch boxes, potentially eliminating much of the remaining overhead. However,
such an approach requires knowledge of proprietary routing architectures, and is
therefore deferred to future work.
53
REFERENCES
[1] Xilinx Virtex-5 FPGA Data Sheet: DC and Switching Characteristics, 2010.
[3] Ashenhurst, R. L. “The decomposition of switching functions.” Proc. Internatl.Symp. Theory of Switching, Annals Computation Lab. vol. 29. Cambridge, Mass:Harvard University, 1957, 74–116.
[4] Athanas, P., Bowen, J., Dunham, T., Patterson, C., Rice, J., Shelburne, M., Suris,J., Bucciero, M., and Graf, J. “Wires on Demand: Run-Time CommunicationSynthesis for Reconfigurable Computing.” FPL ’07: International Conference onField Programmable Logic and Applications. 2007, 513–516.
[5] Becker, J., Pionteck, T., Habermann, C., and Glesner, M. “Design andimplementation of a coarse-grained dynamically reconfigurable hardwarearchitecture.” VLSI ’01: Proceedings of IEEE Computer Society Workshop onVLSI. 2001, 41–46.
[6] Betz, Vaughn and Rose, Jonathan. “VPR: A new packing, placement and routingtool for FPGA research.” FPL ’97: Proceedings of the 7th International Workshopon Field-Programmable Logic and Applications. London, UK: Springer-Verlag,1997, 213–222.
[7] Brant, A. and Lemieux, G.G.F. “ZUMA: An Open FPGA Overlay Architecture.” Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th AnnualInternational Symposium on. 2012, 93 –96.
[8] Callahan, Timothy J., Chong, Philip, DeHon, Andre, and Wawrzynek, John. “Fastmodule mapping and placement for datapaths in FPGAs.” FPGA ’98: Proceedingsof the 1998 ACM/SIGDA sixth international symposium on Field programmable gatearrays. New York, NY, USA: ACM, 1998, 123–132.
[9] Chen, Chau-Shen, Tsay, Yu-Wen, Hwang, TingTing, Wu, A.C.H., and Lin,Youn-Long. “Combining technology mapping and placement for delay-minimizationin FPGA designs.” Computer-Aided Design of Integrated Circuits and Systems,IEEE Transactions on 14 (1995).9: 1076 –1084.
[10] Compton, Katherine and Hauck, Scott. “Totem: Custom Reconfigurable ArrayGeneration.” FCCM ’01: Proceedings of the the 9th Annual IEEE Symposium onField-Programmable Custom Computing Machines. Washington, DC, USA: IEEEComputer Society, 2001, 111–119.
[11] Coole, James and Stitt, Greg. “Intermediate Fabrics: Virtual Architectures for CircuitPortability and Fast Placement and Routing.” CODES/ISSS ’10: Proceedings ofthe IEEE/ACM/IFIP international conference on Hardware/Software codesign andsystem synthesis. 2010, 13–22.
[12] Cox, C.E. and Blanz, W.E. “GANGLION-a fast field-programmable gate arrayimplementation of a connectionist classifier.” Solid-State Circuits, IEEE Journal of27 (1992).3: 288 –299.
[13] Dehon, Andre Maurice. Reconfigurable architectures for general-purpose comput-ing. Ph.D. thesis, 1996. AAI0597715.
[14] Donthi, S. and Haggard, R.L. “A survey of dynamically reconfigurable FPGAdevices.” System Theory, 2003. Proceedings of the 35th Southeastern Symposiumon. 2003, 422 – 426.
[15] Ebeling, Carl, Cronquist, Darren C., and Franklin, Paul. “RaPiD - ReconfigurablePipelined Datapath.” FPL ’96: Proceedings of the 6th International Workshop onField-Programmable Logic, Smart Applications, New Paradigms and Compilers.London, UK: Springer-Verlag, 1996, 126–135.
[16] Eguro, Ken and Hauck, Scott. “Armada: timing-driven pipeline-aware routingfor FPGAs.” FPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th internationalsymposium on Field programmable gate arrays. New York, NY, USA: ACM, 2006,169–178.
[17] Eldredge, J.G. and Hutchings, B.L. “Density enhancement of a neural networkusing FPGAs and run-time reconfiguration.” FPGAs for Custom ComputingMachines, 1994. Proceedings. IEEE Workshop on. 1994, 180 –188.
[18] Farrahi, A.H. and Sarrafzadeh, M. “Complexity of the lookup-table minimizationproblem for FPGA technology mapping.” Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions on 13 (1994).11: 1319 –1332.
[19] Foulk, P.W. “Data-folding in SRAM configurable FPGAs.” FPGAs for CustomComputing Machines, 1993. Proceedings. IEEE Workshop on. 1993, 163 –171.
[20] Giri, A., Visvanathan, V., Nandy, S.K., and Ghoshal, S.K. “High speed digitalfiltering on SRAM-based FPGAs.” VLSI Design, 1994., Proceedings of the SeventhInternational Conference on. 1994, 229 –232.
[21] Goslin, G. “Using Xilinx FPGAs to design custom digital signal processing devices.”1995, 565–604.
[22] Grant, David, Wang, Chris, and Lemieux, Guy G.F. “A CAD framework for Malibu:an FPGA with time-multiplexed coarse-grained elements.” Proceedings of the 19thACM/SIGDA international symposium on Field programmable gate arrays. FPGA’11. New York, NY, USA: ACM, 2011, 123–132.
55
[23] Gunther, B., Milne, G., and Narasimhan, L. “Assessing document relevance withrun-time reconfigurable machines.” FPGAs for Custom Computing Machines, 1996.Proceedings. IEEE Symposium on. 1996, 10 –17.
[24] Hammerquist, M. and Lysecky, R. “Design space exploration for applicationspecific FPGAs in system-on-a-chip designs.” SOC ’08: Proceedings of the IEEEInternational SOC Conference. 2008, 279–282.
[25] Hayes, J.P. “A unified switching theory with applications to VLSI design.” Proceed-ings of the IEEE 70 (1982).10: 1140 – 1151.
[26] Kapre, Nachiket, Mehta, Nikil, deLorimier, Michael, Rubin, Raphael, Barnor, Henry,Wilson, Michael J., Wrighton, Michael, and DeHon, Andre. “Packet-Switched vs.Time-Multiplexed FPGA Overlay Networks.” Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines. 2006.
[27] Koch, Andreas. “Structured design implementation: a strategy for implementingregular datapaths on FPGAs.” FPGA ’96: Proceedings of the 1996 ACM fourthinternational symposium on Field-programmable gate arrays. New York, NY, USA:ACM, 1996, 151–157.
[28] Koehler, S., Stitt, G., and George, A.D. “Performance visualization and explorationfor reconfigurable computing applications.” ????
[29] Landy, Aaron and Stitt, Greg. “A low-overhead interconnect architecture for virtualreconfigurable fabrics.” Proceedings of the 2012 international conference onCompilers, architectures and synthesis for embedded systems. CASES ’12. NewYork, NY, USA: ACM, 2012, 111–120.
[30] Lavin, C., Padilla, M., Lamprecht, J., Lundrigan, P., Nelson, B., and Hutchings, B.“HMFlow: Accelerating FPGA Compilation with Hard Macros for Rapid Prototyping.”Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19thAnnual International Symposium on. 2011, 117 –124.
[31] Lemoine, E. and Merceron, D. “Run time reconfiguration of FPGA for scanninggenomic databases.” FPGAs for Custom Computing Machines, 1995. Proceedings.IEEE Symposium on. 1995, 90 –98.
[32] Lysaght, Patrick, Stockwood, Jon, Law, J., and Girma, D. “Artificial Neural NetworkImplementation on a Fine-Grained FPGA.” Proceedings of the 4th InternationalWorkshop on Field-Programmable Logic and Applications: Field-ProgrammableLogic, Architectures, Synthesis and Applications. FPL ’94. London, UK, UK:Springer-Verlag, 1994, 421–432.
[33] Lysecky, Roman, Miller, Kris, Vahid, Frank, and Vissers, Kees. “Firm-core VirtualFPGA for Just-in-Time FPGA Compilation (abstract only).” Proceedings of the 2005ACM/SIGDA 13th international symposium on Field-programmable gate arrays.FPGA ’05. New York, NY, USA: ACM, 2005, 271–271.
56
[34] Marshall, Alan, Stansfield, Tony, Kostarnov, Igor, Vuillemin, Jean, and Hutchings,Brad. “A reconfigurable arithmetic array for multimedia applications.” FPGA ’99:Proceedings of the 1999 ACM/SIGDA Seventh International Symposium on FieldProgrammable Gate Arrays. New York, NY, USA: ACM, 1999, 135–143.
[36] McMurchie, Larry and Ebeling, Carl. “PathFinder: a negotiation-basedperformance-driven router for FPGAs.” FPGA ’95: Proceedings of the 1995ACM Third International Symposium on Field Programmable Gate Arrays. NewYork, NY, USA: ACM, 1995, 111–117.
[37] Mulpuri, Chandra and Hauck, Scott. “Runtime and quality tradeoffs in FPGAplacement and routing.” FPGA ’01: Proceedings of the 2001 ACM/SIGDA NinthInternational Symposium on Field Programmable Gate Arrays. New York, NY, USA:ACM, 2001, 29–36.
[38] Murgai, R., Nishizaki, Y., Shenoy, N., Brayton, R.K., and Sangiovanni-Vincentelli, A.“Logic synthesis for programmable gate arrays.” Design Automation Conference,1990. Proceedings., 27th ACM/IEEE. 1990, 620 –625.
[39] Roth, J. Paul and Karp, R. M. “Minimization Over Boolean Graphs.” IBM Journal ofResearch and Development 6 (1962).2: 227 –238.
[40] Sekanina, Lukas. Evolvable Systems: From Biology to Hardware, chap. VirtualReconfigurable Circuits for Real-World Applications of Evolvable Hardware.Springer Berlin / Heidelberg, 2003, 116–137.
[41] Shukla, Sunil, Bergmann, Neil W., and Becker, Jurgen. “QUKU: A Two-LevelReconfigurable Architecture.” ISVLSI ’06: Proceedings of the IEEE ComputerSociety Annual Symposium on Emerging VLSI Technologies and Architectures.Washington, DC, USA: IEEE Computer Society, 2006, 109.
[42] Stitt, G. and Coole, J. “Intermediate Fabrics: Virtual Architectures for Near-InstantFPGA Compilation.” Embedded Systems Letters, IEEE 3 (2011).3: 81 –84.
[43] Tsu, William, Macy, Kip, Joshi, Atul, Huang, Randy, Walker, Norman, Tung, Tony,Rowhani, Omid, George, Varghese, Wawrzynek, John, and DeHon, Andre. “HSRA:high-speed, hierarchical synchronous reconfigurable array.” FPGA ’99: Proceedingsof the 1999 ACM/SIGDA seventh international symposium on Field programmablegate arrays. New York, NY, USA: ACM, 1999, 125–134.
[44] Villasenor, J., Schoner, B., Chia, Kang-Ngee, Zapata, C., Kim, Hea Joung, Jones,C., Lansing, S., and Mangione-Smith, B. “Configurable computing solutions forautomatic target recognition.” FPGAs for Custom Computing Machines, 1996.Proceedings. IEEE Symposium on. 1996, 70 –79.
57
[45] Wang, J., Chen, Q.S., and Lee, C.H. “Design and implementation of a virtualreconfigurable architecture for different applications of intrinsic evolvable hardware.”Computers & Digital Techniques, IET 2 (2008).5: 386–400.
[46] Wirthlin, Michael J. and Hutchings, Brad L. “Improving Functional Density ThroughRun-Time Constant Propagation.” In ACM/SIGDA International Symposium onField Programmable Gate Arrays. 1997, 86–92.
[47] Yiannacouras, Peter, Steffan, J. Gregory, and Rose, Jonathan. “VESPA: portable,scalable, and flexible FPGA-based vector processors.” Proceedings of the 2008international conference on Compilers, architectures and synthesis for embeddedsystems. CASES ’08. New York, NY, USA: ACM, 2008, 61–70.
58
BIOGRAPHICAL SKETCH
Aaron Landy received the Bachelor of Science in Electrical Engineering degree
from the University of Texas at Austin in 2011, with specialization in Computer
Architecture and Embedded Systems. While at the University of Texas, he worked
under Dr. Derek Chiou to implement a low-overhead in-situ debugging framework for
FPGA applications.
In 2011, he worked in Post-Silicon Validation for the Atom System-on-Chip at
Intel Corporation in Austin, Texas. He joined the NSF Center for High Performance
Reconfigurable Computing (CHREC) at the University of Florida as a Ph.D. student
and research assistant under Dr. Greg Stitt. Aaron received the Master of Science in
Electrical Engineering degree from the University of Florida in 2013.
His research interests include reconfigurable computing, computer architecture,
and embedded systems. His current work focuses on FPGA toolflows and productivity,
particularly fast place-and-route, high-level synthesis, and FPGA virtualization.