INTERMEDIATE FABRICS: LOW-OVERHEAD …ufdcimages.uflib.ufl.edu/UF/E0/04/55/13/00001/LANDY_A.pdf · intermediate fabrics: low-overhead coarse-grained virtual reconfigurable fabrics

INTERMEDIATE FABRICS:LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO

ENABLE FAST PLACE AND ROUTE

By

AARON LANDY

A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2013

c⃝ 2013 Aaron Landy

2

ACKNOWLEDGMENTS

I thank the chair and members of my supervisory committee for their mentoring

and time, the University of Florida Graduate School, the National Science Foundation,

and the NSF Center for High Performance Reconfigurable Computing (CHREC) for their

generous support. I thank my parents for their many years of loving encouragement, and

I thank Elyse for supporting my goals and dreams.

3

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Placement and Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Constant Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Intermediate Fabrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 INTERCONNECT ENHANCEMENTS . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Intermediate Fabric Architecture . . . . . . . . . . . . . . . . . . . . . . . 163.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Previous Interconnect Architecture . . . . . . . . . . . . . . . . . . 18

3.2 Optimized Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 Routability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.4 Interconnect Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 263.3.5 Interconnect Comparison for Uniform Intermediate Fabrics . . . . . 263.3.6 Interconnect Comparison for Specialized Intermediate Fabrics . . . 29

4 PSEUDO CONSTANT LOGIC OPTIMIZATION . . . . . . . . . . . . . . . . . . 33

4.1 Pseudo-Constant Design Process . . . . . . . . . . . . . . . . . . . . . . 344.1.1 Pseudo-Constant Identification . . . . . . . . . . . . . . . . . . . . 354.1.2 Pseudo-Constant Technology Mapping . . . . . . . . . . . . . . . . 354.1.3 Pseudo-Constant Bitfile Creation . . . . . . . . . . . . . . . . . . . 364.1.4 Pseudo-Constant Invatidation Detection . . . . . . . . . . . . . . . 37

4.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.1 Pseudo-Constant Primitives for Xilinx Virtex 5 . . . . . . . . . . . . 38

4.2.1.1 Distributed RAM . . . . . . . . . . . . . . . . . . . . . . . 394.2.1.2 Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 Architectural Extensions . . . . . . . . . . . . . . . . . . . . . . . . 41

4

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.0.1 32-bit Full Adder . . . . . . . . . . . . . . . . . . . . . . . 444.3.0.2 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.1 32-bit Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.2 Functional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5

LIST OF TABLES

Table page

3-1 A comparison between the presented virtual interconnect and previous uniformvirtual interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3-2 Comparison of Intermediate Fabric Overhead . . . . . . . . . . . . . . . . . . . 31

6

LIST OF FIGURES

Figure page

3-1 Overview of an intermediate fabrics implementation . . . . . . . . . . . . . . . 17

3-2 Previous intermediate fabric interconnect architecture . . . . . . . . . . . . . . 18

3-3 An optimized virtual-track implementation . . . . . . . . . . . . . . . . . . . . . 20

3-4 Layout of intermediate fabric using optimized interconnect . . . . . . . . . . . . 21

3-5 Switch box topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3-6 Virtex 4 LX100 multiplexer LUT usage . . . . . . . . . . . . . . . . . . . . . . . 23

4-1 A comparison of constant propagation . . . . . . . . . . . . . . . . . . . . . . . 34

4-2 Functional architecture of a Xilinx Virtex 5 LUT . . . . . . . . . . . . . . . . . . 38

4-3 A Xilinx Virtex 5 SLICEM configured as distributed RAM . . . . . . . . . . . . . 40

4-4 A modified Virtex 5 slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4-5 A pseudo-constant adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4-6 Comparison of adder LUT utilization . . . . . . . . . . . . . . . . . . . . . . . . 46

4-7 Comparison of multiplexer LUT utilization . . . . . . . . . . . . . . . . . . . . . 47

4-8 Pseudo-constant comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4-9 Functional density of pseudo-constant adders . . . . . . . . . . . . . . . . . . . 51

4-10 Functional density of pseudo-constant multiplexers . . . . . . . . . . . . . . . . 51

7

Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

INTERMEDIATE FABRICS:LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO

ENABLE FAST PLACE AND ROUTE

By

Aaron Landy

May 2013

Chair: Greg StittMajor: Electrical and Computer Engineering

Field-programmable gate arrays (FPGAs) have been widely shown to have

significant performance and power advantages compared to microprocessors and

graphics-processing units (GPUs), but remain a niche technology due in part to

productivity challenges. Although such challenges have numerous causes, previous

work has shown two significant contributing factors: 1) prohibitive place-and-route

times preventing mainstream design methodologies, and 2) limited application

portability preventing design reuse. Virtual reconfigurable architectures, referred

to as intermediate fabrics (IFs), were recently introduced as a potential solution to

these problems, providing 100x-1000x place-and-route speedup, while also enabling

application portability across potentially any physical FPGA. However, one significant

limitation of existing intermediate fabrics is area overhead incurred from virtualized

interconnect resources. In this work, we approach this problem through complementary

top-down and bottom-up approaches, seeking to reduce both the number and size of

multiplexers that comprise the interconnect. First we perform design-space exploration

of virtual interconnect architectures and introduce an optimized virtual interconnect

that reduces area overhead by 48% to 54% compared to previous work, while also

improving clock frequencies by 24% with a modest routability overhead of 16%.

We also extend ”constant folding” used by traditional logic optimization to support

8

pseudo-constants, which are values that change with low frequency. We present a

method of pseudo-constant logic optimization based on dynamically reconfigurable

capabilities of FPGAs, which optimizes logic for different pseudo-constant values and

then reconfigures the logic whenever the pseudo-constant changes. We show this

optimization achieves up to a 1.25x increase in functional density for multiplexers.

9

CHAPTER 1INTRODUCTION

Many studies in reconfigurable computing have shown that FPGAs enable orders

of magnitude performance, power, and size advantages over traditional microprocessor

and graphics-processing units for applications in many fields of both high-performance

and embedded computing. Despite these advantages, many mainstream designers

do not use FPGAs due to the high complexity, poor productivity, and lack of portability

of modern FPGA design methodologies. Advances in high-level synthesis promise to

enable FPGA compilation from high-level heterogeneous processing languages, such

as OpenCL and CUDA. While compiler improvements help free designers from the

low-level complexity of the traditional ASIC prototyping based FPGA design flow, these

improvements do not address the extremely long complilation times needed to perform

full-detail placement and routing of an FPGA design. Despite ongoing research into

full-detail place and route (PAR) speedup, PAR can last hours for small designs and

even days for large designs [11]. Additionally, once a design has been compiled for a

specific FPGA, the resulting bitfile cannot be used on any other type of FPGA, even

within the same family.

Previous work [11][42] has presented intermediate fabrics (IFs) as a potential

solution to these problems, providing orders of magnitude PAR speedup and application

portability across physical devices. Intermediate fabrics are course-grained, application

specialized, virtual reconfigurable fabrics implemented on off-the-shelf FPGAs. By

abstracting fine-grained resources such as individual FPGA lookup-tables (LUT),

registers, and fixed-hardware arithmetic units into coarse-grained functional operators

and logic cores, intermediate fabrics enable quick mapping of high-level operations onto

FPGA hardware.

Although intermediate fabrics provide significant productivity improvements,

previous fabric implementations have limited applicability due to area overhead incurred

10

by the virtual interconnect, which prohibits many usage cases. Although this overhead

can be reduced via specialization [11], previous intermediate fabrics can still use 2.5x

the area of a circuit directly implemented on a physical FPGA [42]. In this paper, we

seek to address and reduce virtual interconnect area overhead. We examine the

virtual interconnect architecture employed by previous intermediate fabrics studies,

considering tradeoffs between area overhead, clock overhead, place-and-route time,

routing flexibility, bit file size, and reconfiguration time, among others.

By redesigning the local connectivity configuration of the interconnect, we identify

an optimized alternative architecture that reduces LUT requirements by 48%-54% and

flip-flop requirements by 46%-59%, while improving clock frequencies by an average of

24%. To achieve these improvements, the new interconnect has a routability overhead of

16%, which could be addressed by sacrificing a small amount of area savings to include

more virtual routing resources.

Additionally, we explore a complementary approach to reducing interconnect

area overhead by reducing the area consumed by each of the many multiplexers that

compose the interconnect. We introduce pseudo-constant logic optimization, which

is conceptually similar to traditional constant folding [19], widely used in static logic

optimization. However, unlike traditional constant folding, a pseudo-constant logic

optimization exploits FPGA lookup-table (LUT) reconfigurability to dynamically modify

the synthesized logic, allowing for infrequent changes in the pseudo-constant value.

We show that for common types of logic, such as multiplexers, pseudo-constant logic

optimizations can achieve area savings from 27%-50% on Xilinx Virtex 5 FPGAs.

Additionally, we show that pseudo-constant optimized multiplexers match the functional

density of traditional synthesis in as few as 128 operations per invalidation, and

approach up to 1.25x greater functional density for infrequent invalidations.

In this article, we describe intermediate fabrics, examine related work in FPGA

virtualization and fast place and route, discuss the sources of overhead in intermediate

11

fabrics, and present a novel low-overhead interconnect architecture. Additionally,

we discuss pseudo-constant based logic optimization and its application to reduce

intermediate fabric interconnect overhead.

12

CHAPTER 2RELATED WORK

This chapter examines previous work related to the optimizations presented in the

following chapters, and highlighting key similarities and differences. Specifically, we

discuss those works relating to FPGA placement and routing, FPGA overlay networks,

and intermediate fabrics.

2.1 Placement and Routing

Much previous work has focused on fast place-and-route using both coarse-grained

architectures [8] [27] [41] [47] and specialized algorithms [4] [30] [37], in some

cases combined with a place-and-route-amenable fabric. Intermediate fabrics are

complementary to these approaches and could potentially use these algorithms for

place-and-route.

2.2 Overlay Networks

Numerous previous studies have focused on overlay networks, which are conceptually

similar to intermediate fabrics and implement a virtual network atop a physical FPGA.

For example, Kapre et al. [26] compared tradeoffs between packet-switched and

time-multiplexed overlay networks implemented on an FPGA. Intermediate fabrics differ

from these overlay networks by providing a virtual interconnect capable of implementing

register-transfer-level (RTL) circuits at different levels of granularity as opposed to

arbitrary communication between abstract processing elements. By this definition, an

intermediate fabric is an overlay network, but an overlay network is not necessarily an

intermediate fabric.

Previous work has also investigated fine-grained overlay networks for virtual FPGAs

[7] [33]. Virtual FPGAs are conceptually similar to intermediate fabrics, which also

provide virtual reconfigurable fabrics for implementing digital circuits. However, overlays

for virtual FPGAs closely imitate fine-grained FPGA architectures [7] [33] (e.g. LUTs

as resources). Intermediate fabrics can also implement LUT-based architectures, but

13

instead are usually specialized for specific domains and even individual applications

using a resource granularity uncommon to FPGAs, which provides fast place-and-route.

Previous virtual FPGAs can be viewed as specific, low-level instances of an intermediate

fabric. One key difference is that because intermediate fabrics can be specialized,

interconnect requirements differ from fine-grained virtual FPGAs, and also vary

between specializations. Numerous previous studies have introduced reconfigurable,

coarse-grained physical devices for different application domains [5] [10] [15] [22]

[24] [34] [40] [41] [43]. Although those devices provide good performance for their

targeted applications, the disadvantage of such an approach is that specialized physical

devices generally have high costs due to limited economy of scale. Intermediate

fabrics can provide the same architectures implemented virtually atop common

commercial-off-the-shelf FPGAs, which has significant cost advantages and an

acceptable overhead for some use cases. Several studies have also considered virtual

coarse-grained architectures for specific domains [41] [45]. These approaches are

complementary and represent individual instances of intermediate fabrics.

2.3 Constant Propagation

Many studies have shown that constant propagation can increase functional

density and performance [12] [13] [19] [20] [21] [23] [31]. While those techniques

are effective, synthesis must be able to statically identify constants. The presented

work enables these optimizations in cases where a constant value is not known at

compile time, and also when a value changes with low frequency. Previous studies have

demonstrated a concept similar to pseudo-constants by using partial reconfiguration

for run-time logic minimization [17] [23] [31] [32] [44] [46]. Previous work also showed

that partial reconfiguration can have prohibitive reconfiguration times, implementation

complexity, and limitations on reconfiguration granularity [14] [35] [46]. This past work

examined trade-offs between area and reconfiguration time when using run-time logic

optimization, and included the development of a functional density metric to quantify the

14

advantages. We extend past work by reducing reconfiguration times and implementation

complexity via the LUT-based RAM primitives provided by most FPGAs. Prior studies

have also used LUT RAM as dynamically reconfigurable logic. The FPGA overlay

network presented by Brant et al. [7] used LUT RAM to implement virtual LUTs in a

virtual FPGA fabric. That work also decreased multiplexer resources via an approach

similar to what we describe. We expand upon that work by generalizing pseudo-constant

logic optimization for potentially any logic function.

2.4 Intermediate Fabrics

In [11], Coole and Stitt introduce intermediate fabrics as a possible solution to

exceedingly long FPGA place-and-route times. They also propose fabric specialization

to address area overhead concerns. Using specialization, fabric overhead can be

reduced by including in the fabric only those resources essential to implement a given

application. This represents the lowest overhead achievable by early intermediate

fabrics, but pays signifcant penalty in fabric resuability. The optimizations presented in

this work offer alternative approaches to overhead reduction without sacrificing fabric

reusability.

15

CHAPTER 3INTERCONNECT ENHANCEMENTS

This chapter discusses enhancements made to the intermediate fabric interconnect

architecture to reduce area overhead while minimizing routability tradeoffs. The chapter

first provides an overview of the intermediate fabric architecture and the interconnect

used by initial intermediate fabric studies. It then details the optimized interconnect style

and finally compares area overhead and routability between the original and optimized

interconnect.

3.1 Intermediate Fabric Architecture

This section overviews intermediate fabrics in Section 3.1.1 and then discusses the

virtual interconnect architecture used by previous intermediate fabrics in Section 3.1.2.

3.1.1 Overview

As shown in Figure 3-1, an intermediate fabric is a virtual reconfigurable device,

implemented atop a physical FPGA, which implements circuits from HDL or high-level

code via synthesis, placement, and routing. Intermediate fabrics, like overlay networks

[26] and virtual FPGAs [7][33], provide a fabric capable of implementing numerous

circuits. However, unlike those techniques, intermediate fabrics tend to be specialized

for the requirements of a specific set of applications, while providing enough routability

to support similar applications or different functions in the same domain. The example

in Figure 3-1 illustrates an intermediate fabric specialized for a frequency-domain

signal-processing circuit, and provides corresponding floating-point resources for

FFTs and arithmetic computation. When directly compiling this circuit to an FPGA,

place-and-route is likely to require hours due to the compiler decomposing the circuit

into tens-of-thousands of LUTs. However, when targeting the intermediate fabric,

the compiler decomposes the circuit into several coarse-grained resources, which

reduces the place-and-route input size by orders of magnitude and provides 100x to

1000x place-and-route speedup [11][42]. A complete discussion of intermediate fabric

16

FPGA

Application Circuit

w/ Floating-Point

Operations

Synthesis,

Place & Route

Fabric

Library

FFT

*+/-

*

* **+/- +/-+/-

FFT

IFFT*

FFT

*+/-

*

* **+/- +/-+/-

FFT

IFFT*

FFT

* *-

FFT

IFFT

Intermediate Fabric (IF)

w/ Floating-Point Resources

1) Fast compilation via abstraction

(few course-grained resources as

opposed to 100k LUTs)

2) Circuit portability across physical FPGAs

Intermediate Fabric

Figure 3-1. Intermediate fabrics (IFs) are virtual application-specialized fabricsimplemented atop FPGAs that hide physical device complexity to achievefast place-and-route and application portability.

usage models and their implementations is outside the scope of this paper; we instead

summarize two basic models. The library model provides a large, pre-implemented

set of intermediate fabrics that a designer or synthesis tool can choose from based on

the requirements of the application. For the example in Figure 3-1, a designer or tool

could choose the selected fabric from one of many fabrics that provide different fabric

sizes, different combinations of resources, different precisions, etc. An alternative is the

synthesis model, during which the synthesis tool creates a specialized fabric based on

the application requirements. The advantage to the synthesis model is reduced area

overhead. However, the disadvantage is that the application designer must wait for

place-and-route to implement the intermediate fabric on the physical FPGA. Although

such place-and-route may require hours, the compilation time is amortized over the

lifetime of the fabric because the physical place-and-route is only needed once.

17

Computational Unit

(CU)

Switch

Box

(SB)

Switch

Box

(SB)

Connection

Box (CB)Connection

Box (CB)

Connection

Box (CB)

Switch

Box

(SB)

Switch

Box

(SB)

Connection

Box (CB)

Connection

Box

Switch

Box

East

CU North

OutputInput

CU

North

Output

Switch

Box

East

Source

Switch

Box

West

Source

CU

South

Output

CU

North

Input

Switch

Box

East

Sink

Switch

Box

West

Sink

CU

South

Input

CU South

InputOutput

a) b) c)

Switch

Box

West

Routing

Track

Routing

Track

Track Sinks

Track Sources

mux select

Configuration bits

Figure 3-2. Previous intermediate fabric interconnect architecture, where (b) routingtracks between resources were implemented as (c) multiplexers based onthe number of track sources.

3.1.2 Previous Interconnect Architecture

Figure 3-2(a) illustrates the basic island-style fabric used in previous intermediate

fabrics [11][42]. Such a fabric closely imitates the widely studied structure of physical

FPGAs consisting of switch boxes, connection boxes, and bidirectional routing tracks,

but replaces LUTs with application-specific resources (e.g., floating-point units, FFTs)

referred to as computational units (CUs). Note that because intermediate fabrics

can be specialized, the CUs and virtual routing tracks can potentially be any width.

For example, a fabric with floating-point CUs might provide 32-bit routing tracks.

Intermediate fabrics also contain specialized regions for control and memory operations.

However, in this paper, we focus on the areas of a circuit that contribute the most to long

place-and-route, which for many applications are coarse-grained, pipelined datapath

operations (e.g., FFTs).

The main limitation of previous intermediate fabrics is area overhead incurred

by implementing the virtual fabric atop a physical FPGA (i.e., synthesized VHDL for

the virtual fabric). Such overhead results from several sources. The largest source

of overhead comes from mux logic in the virtual interconnect. Previous intermediate

fabrics use virtual bidirectional routing tracks [11][42], whose register-transfer-level

(RTL) implementation is shown in Figure 3-2(b) and (c). For an m-bit track with n

possible sources, the RTL implementation uses an m-bit, n:1 mux, in some cases with

18

a register or latch on the mux output. For example, Figure 3-2(b) shows a common

configuration of a bidirectional track with four sources: two switch boxes and two CUs,

with the corresponding RTL implementation shown in Figure 3-2(c) as a 4:1 mux, with

a select value stored in a 2-bit virtual configuration register. Considering the large

number of tracks found in most fabrics, this mux-based implementation of virtual tracks

uses numerous LUT resources in the physical FPGA, and is responsible for over 50%

of the total LUT usage in many intermediate fabrics. Similarly, virtual switch boxes

and connection boxes implement various topologies using additional muxes between

virtual tracks. The exact percentage of LUT usage for switch/connection boxes varies

depending on the box topology and flexibility, but is also a significant contributor to

area overhead. When combining all interconnect resources (tracks, switch boxes, and

connection boxes), we determined that the virtual interconnect is commonly responsible

for over 90% of LUT requirements. In addition to the mux overhead, intermediate fabrics

also require physical flip-flop resources for any storage. Virtual registers are technically

not overhead because synthesis tools can directly implement virtual registers on

physical flip-flops in the FPGA. However, virtual configuration flip-flops and any pipelined

interconnect is overhead because the resulting physical flip-flops would not be used by a

circuit directly targeting the FPGA.

3.2 Optimized Interconnect

Based on the significant overhead caused by the virtual interconnect described

in the previous section, in this paper we focus on virtual interconnect optimizations

to reduce muxes, with the goal of retaining high routability. During an initial attempt

at optimizing virtual tracks, we observed that the RTL implementation shown in

Figure 3-2(c) contains some redundancy that could potentially be removed. Specifically,

a physical track would never have a common source and sink, which results in an

unnecessary input to the mux. For example, a physical FPGA would never route a signal

out of a switch box and back into the same switch box using the same track. Therefore,

19

CU

North

Output

Switch

Box

East

Source

Switch

Box

West

Source

CU

South

Output

CU

North

Input

Switch

Box

East

Sink

Switch

Box

West

Sink

CU

South

Input

Component1

Output

Component2

Output

Component1

Sink

Component2

Sink

a) b)

Figure 3-3. (a) An optimized virtual-track implementation to reduce routing redundancy,which eliminates muxes when (b) tracks have two sources.

we can eliminate the redundant routes and replace the n:1 mux with n different, n-1:1

muxes, where each mux defines one of the possible track destinations. Figure 3-3(a)

shows an example for the previous track in Figure 3-2(c), where n=4. Despite eliminating

routing redundancy, such an approach does not save area because in most cases, n

separate n-1:1 muxes require more LUTs than a single n:1 mux.

However, we have observed there is a special case where the track implementation

in Figure 3-3(a) can achieve reduced area. For any virtual track with exactly two

possible sources, this implementation simplifies into two directional wires as shown

in Figure 3-3(b). In other words, a 2-source virtual track requires two separate 1:1

muxes, but a 1:1 mux is just a wire. Therefore, by using only 2-source virtual tracks

throughout the entire intermediate fabric, we can potentially replace all mux logic and

wires in Figure 3-3(a) with two wires for each track. Such an optimization has significant

potential due to virtual tracks contributing to over 50% of area overhead. Furthermore,

this optimization saves a significant amount of wires per track, while simultaneously

improving routability by enabling routing in two directions. An additional advantage

20

CU

Input

Output

Switch

Box

CU

Input

Output

Switch

Box

Switch

Box

CU

Input

Output

Switch

Box

Switch

Box

CU

Input

Output

Switch

Box

Switch

Box

Switch

Box

Switch

Box

Figure 3-4. Layout of intermediate fabric using optimized interconnect with CU I/Oconnected directly to adjacent switchboxes.

is that by reducing muxes, the fabric requires less configuration registers to store the

corresponding select values, which reduces flip-flop overhead while also improving

reconfiguration times. Although using 2-source virtual tracks reduces area, replacing the

3- and 4-source tracks used in previous fabrics is a significant challenge. In a traditional

island-style architecture, a track typically has 3-4 possible sources: 2 switch boxes and

1-2 CUs. If we eliminate the switch box connections, the track can only route between

adjacent resources, which significantly limits routability. Similarly, if we remove the CU

connections, then there is no way for routing to reach CUs.

To address this problem, we considered several significant modifications to

traditional fabrics. First, we started with 2-source tracks between adjacent switch boxes,

with each switch box as a possible source. However, that interconnect configuration

does not provide a mechanism for connecting CUs to the routing tracks. We could have

21

S In

W In

NE O

ut

Reg

NW

Out

Reg

N O

ut

Reg

E OutRe

g

S O

ut

Reg

E In

SW In

put

SE Input

W Out

Re

g

N

W

S

SW

N

E

S

SE

E

W

SW

N

SE

SW

NWE

SSE

E

W

SW

N

SE

NS

S In

W In

N O

ut

Reg

E OutRe

g

S O

ut

Reg

E In

W Out

Re

g

N

W

S

N

E

S

E

W

N

E

W

S

N In N In

a) b)

Figure 3-5. Switch box topologies for (a) previous intermediate fabric interconnect and(b) the presented interconnect with diagonal CU channels.

added connection boxes, but that would violate the 2-source restriction. Therefore,

we considered adding additional channels to each switch box with direct connections

to the CU I/O. The overall fabric layout for this optimized virtual interconnect is shown

in Figure 3-4. As illustrated, in this unconventional fabric, no virtual track has more

than 2 sources, which eliminates all muxes previously needed to implement tracks.

One challenge in designing this optimized interconnect is that although we eliminated

track muxes, we added additional muxes inside of the switch boxes to support the

additional CU channels. Unless the switch boxes add fewer muxes than we removed

from the tracks, this optimization does not reduce area. To ensure that the optimized

interconnect reduces LUT usage, we exploit the internal characteristics of the switch box

to handle the additional routing requirements with minimal logic. Previous intermediate

fabric switch boxes use a planar topology, where each output from the switch box uses

a 3:1 mux that selects an input from one of the three other channels, as shown in

Figure 3-5(a). For the new interconnect, these multiplexors could potentially require

four more inputs to handle routing of the four adjacent CUs, which would significantly

outweigh track savings. However, we can exploit the fact that increasing mux inputs

22

8

16

32

64

128

256

512

2 3 4 5 6 7 8

# of

4-In

put L

UTs

-(lo

g2 s

cale

)

# of Mux Inputs

16-bit

32-bit

64-bit

Data Width

Figure 3-6. Virtex 4 LX100 multiplexer LUT usage for varying MUX input counts. Theplateaus provide opportunities for switch boxes to add more connectionswithout an area penalty.

does not always increase LUT requirements. As shown in Figure 3-6, FPGAs have

different area plateaus where additional mux inputs have the same LUT requirements as

lesser inputs (e.g., 3-4 inputs and 6-8 inputs). The optimized interconnect exploits this

characteristic by adding CU I/O connections to the muxes until reaching the largest input

size of a plateau, which maximizes routability without any increase in area. Interestingly,

the presented interconnect can be specialized for different physical FPGAs, which have

different mux plateaus due to varying LUT sizes.

Although the optimized interconnect switch boxes are not restricted to a specific

topology, we choose a planar-like topology for evaluation and target the mux plateaus for

4-input muxes. Therefore, the switch boxes increase 3-input muxes to 4 inputs wherever

possible. The switch boxes also use 5-input muxes, but do not increase the inputs to

6 or more, despite the plateau between 6 and 8 inputs. Increasing the mux inputs to 8

may improve routability with additional overhead, but we defer such analysis to future

work. An example topology is shown in Figure 3-5(b), where the switch box provides

a planar topology for the north, east, south, and west channels, which correspond to

23

virtual tracks. In this example, the CU channels (southeast, southwest, northwest,

northeast) connect to the other channels in customizable ways. Note that we are not

proposing a specific switch box topology for the optimized interconnect. Instead, like

any intermediate fabric, we expect the topology to change based on application and

routability requirements. For the applications we evaluated, using a highly directional

fabric was beneficial due to pipelined, feed-forward datapaths. However, the switch

box can easily be customized for other topologies. In the experiments, we use a fabric

generation tool that allows specification of the exact switch box topology in a fabric

description file.

3.3 Experiments

In this section, we compare intermediate fabrics using the presented virtual

interconnect with previous work [11][42]. Section 3.3.1 describes the experimental

setup. Section 3.3.5 compares area requirements, clock speedups, and routability

of both approaches for unspecialized, uniform fabrics. Section 3.3.6 presents similar

experiments for application-specialized fabrics.

3.3.1 Experimental Setup

This section describes the intermediate fabric tool flow used for the experiments

(Section 3.3.2), along with the routability measurements (Section 3.3.3), and the tools

used for evaluating the different interconnects (Section 3.3.4).

3.3.2 Tool flow

To implement applications on the intermediate fabrics, we manually synthesize

circuits by creating technology-mapped netlists. We plan to convert open-source

synthesis tools to target intermediate fabrics, including OpenCL high-level synthesis,

but such a project is outside the scope of this paper. For place-and-route, we use the

algorithm previously described in [11] to ensure that the comparison between the new

and previous interconnect is not unfairly skewed by improved placement. In fact, the

place-and-route results for the new interconnect are likely pessimistic because we

24

did not modify the placer cost function for the new interconnect. The place-and-route

algorithm is a variation of VPR [6], and uses simulated annealing for placement with a

cost function that minimizes bounding box size. Routing uses the well-known PathFinder

[36] negotiated-congestion algorithm. Both the new and previous interconnect have

varying amounts of pipelining in switch boxes or on tracks. Instead of using pipelined

routing algorithms (e.g., [16], both approaches use realignment registers in front of each

CU to balance the routing delays of all inputs. Because this pipelining strategy only

works for pipelined datapaths that can be retimed without affecting correctness, we limit

the evaluation to fabrics with coarse-grained resources commonly needed by datapaths

in signal processing. To configure the intermediate fabric for different applications, the

place-and-route tool outputs a configuration bit file that we store in a block RAM on the

targeted FPGA. Each intermediate fabric includes a programmer which loads the bitfile

from the block RAM by shifting bits into virtual configuration registers that control the

CUs and virtual switch boxes.

3.3.3 Routability Metric

To fairly compare tradeoffs between interconnects, it is necessary to measure

routability. To perform these measurements for a given intermediate fabric, we

place-and-route a large number of randomly generated netlists of varying sizes, and

determine the routability score of the interconnect based on the percentage of netlists

that route successfully. Due to the fast place-and-route time for intermediate fabrics we

were able to test 1,000 netlists for each fabric to obtain a high-precision metric. The

random netlist generator creates directed acyclic graph structures representative of

pipelined datapaths. Based on the CU composition of each individual fabric tested, the

generator creates a random number of datapath stages, each consisting of a random

number of technology-mapped cells, and creates random connections between each

stage. Each stage contains at minimum enough cells, and enough connections are

made between stages, such that each cell has at least one path to the next stage. This

25

method results in netlists containing one or more disjoint pipelines of one or more stages

each.

3.3.4 Interconnect Evaluation

To evaluate different interconnects, we developed a tool capable of generating

VHDL for intermediate fabrics using the new interconnect. The tool takes as inputs a

fabric-description file that defines the parameters of the fabric, such as size, aspect ratio,

bit-width and the makeup of the fabric, including CU composition, and row and column

channel descriptions. Channel descriptions include number of tracks, direction of each

track, and switchbox topology.

To obtain physical FPGA utilization and timing results, we synthesized the

intermediate fabric VHDL using Xilinx ISE 10.1, Synopsys Synplify Pro 2012, and

Altera Quartus II 10.1, depending on the targeted FPGA. To evaluate the effects of

FPGA variation on each virtual interconnect, we implemented intermediate fabrics on

Xilinx Virtex 4 LX100 and LX200, Xilinx Virtex 5 LX330, and Altera Stratix IV E530

FPGAs. The intermediate fabric HDL synthesized for each test case uses the fixed-logic

multipliers available on each physical device for all CUs (Xilinx DSP48s and Altera

18x18 Multipliers); therefore all device utilization represents the LUT and flip-flop

overhead of implementing the target application via an intermediate fabric rather than a

direct HDL implementation.

3.3.5 Interconnect Comparison for Uniform Intermediate Fabrics

In this section we compare area, routability, and maximum clock speed of

intermediate fabrics using the presented interconnect to intermediate fabrics using

interconnect previously presented in [11] and [42]. We evaluate each interconnect

using different fabric sizes, implemented on several different physical FPGAs. Although

intermediate fabrics can be specialized to an application, in this section we evaluate

fabrics independently of targeted applications by using a uniform fabric consisting of

16-bit DSP CUs with various dimensions (e.g., 5x5 = 5 rows and 5 columns of I/O and

26

Table 3-1. A comparison between the presented virtual interconnect (New) and previousuniform virtual interconnect (Prev).

LUT Usage Flip-Flop Usage Routability ClockFPGA Fabric

SizePrev New Save Prev New Save Prev New Loss Prev New Speedup

XilinxV4LX200

3x3 2% 1% 71% 1% 1% 72% 100% 78% 22% 173 MHz 175 MHz 1%4x4 5% 2% 64% 1% 1% 65% 100% 95% 5% 163 MHz 172 MHz 6%5x5 8% 3% 60% 2% 1% 62% 100% 87% 13% 152 MHz 172 MHz 13%6x6 12% 5% 55% 3% 1% 59% 100% 85% 15% 144 MHz 171 MHz 19%7x7 17% 8% 53% 5% 2% 57% 100% 84% 16% 123 MHz 170 MHz 38%8x8 23% 11% 52% 6% 3% 56% 100% 85% 16% 125 MHz 170 MHz 36%9x9 30% 15% 51% 8% 4% 55% 99% 84% 16% 115 MHz 168 MHz 46%12x8 36% 20% 46% 10% 5% 55% 99% 79% 20% 113 MHz 160 MHz 42%

XilinxV5LX330

13x13 37% 20% 46% 18% 9% 53% 98% 80% 18% 125 MHz 162 MHz 30%14x14 44% 24% 46% 21% 10% 52% 94% 83% 12% 131 MHz 146 MHz 11%

AlteraS4E530

15x15 n/a* 14% n/a* n/a* 18% n/a* 90% 71% 21% n/a* 175 MHz n/a*16x16 n/a* 16% n/a* n/a* 21% n/a* 90% 70% 22% n/a* 177 MHz n/a*

Average 21% 11% 54% 8% 3% 59% 98% 82% 16% 136 MHz 167 MHz 24%

CUs). Table 3-1 compares LUT and flip-flop utilization (as a % of total device resources),

routability of 1000 randomly generated netlists, and maximum clock speed for identical

intermediate fabrics using the new and previous interconnects. We implemented fabric

sizes between 3x3 and 12x8 on a Virtex 4 LX200, where an NxM fabric is composed

of one row of M inputs, N-2 rows of M CUs, and one row of M outputs. We evaluated

larger fabric sizes of 13x13 and 14x14 on a Virtex 5 LX330, and sizes 15x15 and 16x16

on a large Stratix IV E530. For fabrics using the previous interconnect, we used 3

16-bit tracks per channel with specialized connection boxes from [11], as previous work

indicated this configuration to be an effective tradeoff between routability and overhead.

For fabrics using the new interconnect, we used 2 16-bit tracks per row and 4 tracks

per column with the switch box topology described in Section 3.2 optimized for 4-input

muxes.

These results show the LUT and flip-flop utilizations of the new interconnect are

significantly less than the previous interconnect, with an average LUT savings of 54%

and flip-flop savings of 59% for the fabrics evaluated. Note that we were unable to

synthesize the old interconnect on the Stratix IV device. We tried three different version

of Quartus, but the old interconnect would cause a crash during the retiming stage

of synthesis. For this reason, we exclude the Stratix IV results from the averages.

Additionally, the new interconnect showed significant maximum clock frequency speedup

27

for larger fabrics. When implemented on the Virtex 4, new interconnect clock speeds

decreased only 6.3% between fabrics of size 3x3 to 12x8, whereas the previous

interconnect suffered from a 34.7% decrease in clock speed over the same range.

Overall, the new interconnect averaged 167 MHz compared to 136 MHz. The new

interconnect did incur a routability penalty, with a average decrease of 16% compared

to the previous interconnect. While this overhead is a potential limitation of the new

interconnect, especially when applied to a general-purpose fabric, we believe this

overhead to be an acceptable tradeoff when compared to the significant area savings

provided by the new interconnect. Routability overhead can also be easily compensated

for when designing the CU composition of a fabric. Because the placer algorithm used

in these experiments is unchanged from that used for the old fabric, it is likely that an

appropriately customized placer cost function would significantly improve the routability

of the new interconnect. Similarly, fabrics using the new interconnect could account for

decreased routability by including many more routing resources while still saving area.

Routability decreased monotonically with increased fabric size due to the increased

difficulty of routing larger netlists. The one exception was the 3x3 fabric with the new

interconnect, which had lower routability than the larger fabrics. We identified the source

of this problem as limited connections between I/O and CUs for very small fabrics using

the new interconnect. Because we expect 3x3 to be an unusually small size for actual

usage, this overhead is not a significant limitation. These results also show decreased

LUT overhead savings of only 46% in fabrics implemented on the Virtex 5 device. This

smaller improvement is likely due to different CLB configuration used by that device,

with slightly altered mux-area plateau characteristics, whereas the optimizations used by

the evaluated interconnect were optimized for 4-input muxes. Despite being optimized

for a different LUT configuration, the new interconnect still had significant savings.

Flip-flop usage on the Altera device was significantly higher than both Xilinx devices,

which resulted from the Xilinx FPGAs implementing the realignment registers as SRL16

28

primitives, in contrast to the Altera FPGA which used flip-flops. As future work, we

will investigate optimizations for Altera FPGAs. One additional advantage of reducing

muxes throughout the interconnect is the corresponding elimination of configuration

registers to store the select values. The fewer registers reduce flip-flops, which was

shown in Table 3-1, but also reduces configuration bitfile size, which correspondingly

reduces configuration times and block RAM overhead of the fabric. For the examples in

this section, the new interconnect improved configuration times by an average of 55%

compared to the previous interconnect.

3.3.6 Interconnect Comparison for Specialized Intermediate Fabrics

One advantage of intermediate fabrics is that a designer or tool can specialize the

architecture and interconnect for a given domain or even an individual application. In

this section, we compare intermediate fabrics using application-specialized interconnect

presented in [11] with the new interconnect. To enable a fair comparison, we evaluate

the same application circuits from [11] using the same specialized fabrics as previous

experiments. Specialization used in the previous experiments included varying fabric

sizes and non-uniform interconnects. For the new interconnect, we limit specialization

to fabric sizes, making the results pessimistic. For all specialized fabrics, we used the

smallest fabric and interconnect that could successfully route the target application

netlist. For these experiments, the physical FPGA is a Virtex 4 LX100, which we

chose to match the previous experiments. To perform the comparison, we used the

twelve applications from [11], seven of which were implemented using both 16-bit fixed

point arithmetic and 32-bit floating point arithmetic, indicated with a FXD or FLT suffix

respectively. All track widths matched the CU widths. All circuits without a suffix used

16-bit fixed-point CUs. We briefly summarize the previous applications as follows.

Matrix multiply performs the kernel of a matrix multiplication, calculating the inner

product of two 8-element vectors using 7 adders and 8 multipliers. FIR implements

a 12-tap finite impulse response filter in transpose form with symmetric coefficients

29

using 11 adders and 12 multipliers. N-body, representing the kernel of an N-body

simulation, calculates the gravitational force exerted on a particle due to other particles

in two-dimensional space using 13 adders, multipliers, and a divider. Accum monitors a

stream, counting the number of times the value is less than a threshold. It is the smallest

netlist, consisting of 4 comparators and 3 adders. Normalize normalizes an input stream

using 8 multipliers and 8 adders. Bilinear performs bilinear interpolation on an image,

requiring 8 multipliers and 3 adders. Floyd-Steinberg performs image dithering using

6 adders and 4 multipliers. Thresholding performs automatic image thresholding using

8 comparators and 14 adders. Sobel uses a 3x3 convolution to perform Sobel edge

detection with 2 multipliers and 11 adders. Gaussian blur uses a 5x5 convolution to

perform noise reduction using 25 multipliers and 24 adders. Max filter performs a 3x3

sliding-window image filter with 8 comparators. Mean filter similarly calculates the

average of a sliding window, which we vary from 3x3 to 7x7, requiring a maximum of 48

adders and 1 multiplier. Figure 3-2 compares the interconnects for each case study. The

first major column, Place-and-Route Time, compares place-and-route execution times

for an intermediate fabric with the previous interconnect (IF Prev), an intermediate fabric

with the new interconnect (IF New), and when synthesizing VHDL for each example

directly to the FPGA. The table also shows the resulting place-and-route speedup for the

new and previous interconnects. The results show comparable place-and-route times

for both the old and new interconnect. However, because the previous interconnect

already achieves a place-and-route speedup of 554x compared to an FPGA, the further

improvement by the new interconnect provided a 1350x place-and-route speedup.

The place-and-route speedup was larger for the floating-point examples due to longer

place-and-route times for the physical FPGA. Furthermore, these place-and-route

speedups are highly pessimistic because the specialized examples from [11] do not

include common board logic such as PCIe and memory controllers. Other studies

have shown that including these controllers with tight timing constraints can add up to

30

Table 3-2. A comparison between intermediate fabrics (IFs) with the presented virtualinterconnect (IF New) and previous application-specialized interconnect (IFPrev).

IF Prev IF New FPGA Speedup

Prev

Speedup

New

LUT

Savings

Flip-Flop

Savings

Routability

Overhead

IF Prev IF New Clock

Overhead

Matrix multiply FXD 0.6s 0.6s 1min 08s 112x 112x 56% 60% 1% 170 MHz 186 MHz -9%

Matrix multiply FLT 0.6s 0.6s 6min 06s 602x 602x 59% 59% 1% 184 MHz 222 MHz -21%

FIR FXD 0.6s 0.6s 0min 33s 54x 58x 45% 41% 5% 174 MHz 158 MHz 9%

FIR FLT 0.6s 0.6s 4min 36s 454x 484x 35% 35% 5% 203 MHz 215 MHz -6%

N-body FXD 0.5s 0.2s 0min 57s 126x 300x 40% 32% 1% 185 MHz 165 MHz 11%

N-body FLT 0.5s 0.2s 3min 42s 491x 1168x 37% 26% 1% 218 MHz 200 MHz 8%

AccumFXD 0.1s 0.02s 0min 26s 280x 1733x 52% 53% 0% 186 MHz 187 MHz -1%

Accum FLT 0.1s 0.02s 0min 30s 323x 2000x 52% 50% 0% 225 MHz 241 MHz -7%

Normalize FXD 0.2s 0.3s 1min 10s 299x 241x 66% 71% -63% 178 MHz 162 MHz 9%

Normalize FLT 0.2s 0.3s 6min 44s 1726x 1393x 43% 54% -63% 197 MHz 222 MHz -13%

Bilinear FXD 0.3s 0.3s 1min 08s 230x 213x 51% 47% 0% 184 MHz 165 MHz 10%

Bilinear FLT 0.3s 0.3s 8min 48s 1784x 1650x 41% 42% 0% 206 MHz 200 MHz 3%

Floyd-Steinberg FXD 0.1s 0.1s 1min 27s 621x 926x 53% 50% 2% 182 MHz 169 MHz 7%

Floyd-Steinberg FLT 0.1s 0.1s 5min 37s 2407x 3585x 48% 44% 2% 196 MHz 179 MHz 9%

Thresholding 1.4s 1.3s 0min 33s 24x 26x 44% 36% 5% 167 MHz 181 MHz -8%

Sobel 0.3s 0.4s 2min 28s 500x 344x 44% 31% 2% 181 MHz 162 MHz 10%

Gaussian Blur 3.3s 2.2s 3min 19s 60x 90x 39% 41% -42% 170 MHz 181 MHz -6%

Max Filter 0.2s 0.03s 1min 16s 444x 2533x 48% 41% 0% 186 MHz 176 MHz 5%

Mean Filter 3x3 0.2s 0.01s 2min 30s 962x 10714x 52% 52% 10% 185 MHz 187 MHz -1%

Mean Filter 5x5 1.9s 1.9s 3min 25s 110x 108x 64% 65% -1% 169 MHz 161 MHz 5%

Mean Filter 7x7 8.9s 4.7s 5min 03s 34x 64x 39% 40% -38% 157 MHz 183 MHz -17%

Average 1.0s 0.7s 2min 56s 554x 1350x 48% 46% -8% 186 MHz 186 MHz 0%

Place-and-Route Time Area and Routability Clock Speed

20 minutes to FPGA place-and-route time, but have no effect on intermediate fabric

place-and-route time [42].

The second major column in Figure 3-2 reports area savings of the new interconnect

in terms of FPGA LUTs and flip-flops, along with the routability overhead incurred to

achieve these savings. On average, the new interconnect significantly reduced LUT

usage by 48% and flip-flop usage by 46%, despite the significant specialization by

the previous fabrics. On average, routability slightly improved by 8% with the new

interconnect. However, this average is skewed by three outliers, normalize, Gaussian,

and mean7x7, which had very low routability due to significant specialization in the

previous fabrics. Excluding these outliers, the new interconnect had a 2% routability

overhead. The smaller routability overhead compared to the previous section is due

to the specialized versions of the previous interconnect, which used just enough

31

routing resources to route the targeted application, and therefore lowered general

routability. The final column of Figure 3-2 compares the maximum clock speed of

the specialized fabrics using both the new and old interconnect. For specialized

fabrics, these experiments show a negligible average impact on clock speed, with

both interconnects showing an average clock frequency of 186 MHz. However, there

was significant variation as high as 21% between specialized fabrics. It should be

noted that these results are contrary to the results for larger fabrics presented in the

previous section, which showed a clear trend of faster clock speeds for larger fabrics

using the new interconnect. The reason for the smaller clock improvement compared to

the previous section is due to the higher specialization of the previous interconnect, as

opposed to using a uniform interconnect.

32

CHAPTER 4PSEUDO CONSTANT LOGIC OPTIMIZATION

This chapter discusses a bottom-up approach approach to reducing intermediate

fabric interconnect area overhead. Specifically, we seek to reduce the area consumed

by each of the many multiplexers that compose the interconnect. The optimizations

presented below are complementary to the top-down architectural approach presented

in the previous chapter, seeking to limit the number of multiplexers used in the fabric.

FPGA logic optimization is a widely studied topic, with dozens of existing optimizations

that build upon decades of digital-design research [3][7][9][18][25][38][39]. A common

strategy involves iteratively propagating constants while performing logic minimization

(i.e., constant folding [19]). For example, Figure 4-1(a) shows a 4:1 multiplexer, which

a synthesis tool may map to three 4-input lookup tables (LUTs). In some situations,

as shown in Figure 4-1(b), a constant may propagate to the mux?s select input, which

simplifies the logic to a wire.

Unfortunately, constant-based optimizations have limited applicability. For example,

circuit designers often avoid constant inputs to enable support for as many use cases as

possible [46]. However, we have observed that circuits commonly include signals that

exhibit near-constant behavior where the signal value is rarely changed, which we define

as pseudo-constant. For example, many signal-processing applications initially set a

pseudo-constant convolution kernel, which remains the same for the duration of the

application. Alternatively, each frame of a low frame-rate video may also be considered

pseudo-constant. These pseudo-constant values are often inputs to common logic

components such as adders, multipliers, comparators, and muxes (e.g., [7][29]), which

could potentially benefit from constant folding to reduce area and/or increase replication.

We introduce pseudo-constant logic optimization, which is conceptually similar

to traditional constant folding, widely used in static logic optimization. However,

when a pseudo-constant changes values at runtime, the optimized logic becomes

33

LUT

LUT LUT

i0 i1 s0 i2 i3 s1

s1

Technology-Mapped Circuit

s0

constant S = �01✁

i1

o0

LUT

i0 i1 i2 i3

o0pseudo-constant S = �01✁

With LUTRAM

i0 i1 i2 i3

o0

WithPartial

Reconfiguration

o0

i0 i1 i2 i3

s0s1

o0

+ requires 0 LUTs- requires statically known constant

+ requires 1 LUTs- must be reconfigured on input change (fast)

+ supports all inputs- requires 3 LUTs

+ requires 0 LUTs- must be reconfigured on input change (slow)

Circuit

(a)

(b)

(c)

Figure 4-1. A comparison of constant propagation for a multiplixer with (a) anon-constant select, (b) a constant select, and (c) when usingpseudoconstant logic optimization for inputs that rarely change.

invalid. To prevent these invalidations from affecting correctness, we exploit FPGA

lookup-table (LUT) reconfigurability to dynamically modify the logic according to

the new pseudo-constant value. Although LUT reconfiguration causes performance

overhead, low-frequency invalidations often make this overhead insignificant. In general,

higher invalidation frequencies provide various tradeoffs between area savings and

performance overhead.

This chapter discusses the design process, implementation, and evaluation of

pseudo-constant logic optimization.

4.1 Pseudo-Constant Design Process

This section defines the four components of pseudo-constant logic optimization:

identification, technology mapping, bitfile creation, and invalidation detection.

34

4.1.1 Pseudo-Constant Identification

The first step of pseudo-constant logic optimization is the identification of potential

pseudo-constants. Our current approach uses designer-specified identification,

where designers use knowledge of an applications behavior to manually specify

pseudo-constant signals. Rather than requiring the actual value of a pseudo-constant,

a designer or synthesis tool need only know that a signal will be pseudo-constant (e.g.,

a convolution kernel). Although designers may often be aware of pseudo-constants,

there will be situations where potential pseudo-constants are not obvious. Synthesis

tools could potentially use a profiling-based heuristic that profiles the number of distinct

values of a given signal, along with the frequency that the value changes (i.e., the

invalidation frequency). Furthermore, we envision a hybrid approach where designers

specify the signals to profile. Previous work [28] has introduced such profiling for both

simulation and in-circuit behavior. We plan to investigate automatic pseudo-constant

identification as future work.

4.1.2 Pseudo-Constant Technology Mapping

One important difference between pseudo-constant and traditional logic optimization

is that the elaborated circuits may be identical, but may differ significantly after

technology mapping. For example, consider the 4:1 multiplexer from Figure 4-1, with a

constant or pseudo-constant select input. For this example, logic optimizations would

replace the multiplexer with a wire connected to the mux input that corresponds to

the selects constant/pseudo-constant value. However, depending on the available

FPGA primitives, technology mapping for the pseudo-constant logic may require more

than a wire because the resulting circuit must handle changes caused by invalidated

pseudo-constants. To deal with these invalidations, any logic that is optimized in the

previous step is marked as pseudo-constant logic, which technology mapping handles

differently from normal logic. Technology mapping for pseudo-constant logic is similar to

traditional technology mapping, but is restricted to FPGA primitives that support runtime

35

reconfiguration. Although there could potentially be numerous primitives, in this paper

we focus on common primitives in existing FPGA devices: LUT RAM and LUT shift

registers. Section 4.2 describes example mappings for Xilinx devices. Pseudo-constants

are also possible on Altera devices using MLABs, but evaluation of MLABs is outside

the scope of this paper. Previous work has focused on using partial reconfiguration for

similar goals [46], which we omit from this study due to the long reconfiguration times

compared to rewriting LUT contents. However, partial reconfiguration may represent a

Pareto-optimal tradeoff in terms of area savings and performance overhead. We plan to

investigate these tradeoffs as future work

4.1.3 Pseudo-Constant Bitfile Creation

After technology mapping, the resulting circuit must create and/or provide a small,

corresponding bitfile that implements the logic for each pseudo-constant value. In the

case of LUT RAM or LUT shift registers, this bitfile is simply the truth table stored in

the LUT. For the mux example in Figure 4-1(c), the bitfile for the pseudo-constant mux

is 16 bits, due to the 4-input LUT (24 bits). The overhead of pseudo-constant bitfile

creation depends on the characteristics of a particular pseudo-constant, which may be

amenable to either offline or online creation. In this paper, we focus mainly on offline

creation, but discuss the tradeoffs and challenges of online creation to present all

possible use cases. Offline creation is possible when the designer or synthesis tool

is aware that a pseudo-constant only has several possible values. In this case, the

synthesis tool can pre-compute the bitfile for all possible values and store the bitfiles

in on-chip memory, which the circuit loads into the corresponding primitives during

a pseudo-constant invalidation. For example, for a 4:1 mux with a pseudo-constant

select, a synthesis tool could statically determine four separate bitfiles and store them

in a block RAM or other memory. Offline creation is not limited to functions with small

numbers of inputs. For example, an input to a 32-bit comparator may only have two

different possible values for a given application (e.g., a runtime-specified threshold in

36

an image-processing application), which would enable a synthesis tool to statically

create two separate bitfiles. Online bitfile creation is needed when a synthesis tool is

not aware of the different possible values of a pseudo-constant, or alternatively when

there are too many possible values, which would require a significant amount of on-chip

memory to store bitfiles. In general, online bitfile creation is more complicated and

requires a portion of the circuit, or a co-processor, to calculate truth tables for invalidated

pseudo-constant logic. In many situations, online creation is not practical because the

logic required for bitfile creation is larger than the savings from the pseudo-constant

logic. Note that pseudo-constant bitfiles also create memory overhead. We therefore

expect pseudo-constant optimization to be appropriate where block RAM is not the main

resource bottleneck of an application.

4.1.4 Pseudo-Constant Invatidation Detection

Pseudo-constant circuits must identify when a pseudo-constant changes values,

which we refer to as invalidation detection. After detecting an invalidation, the circuit

loads a new bitfile into the corresponding resources. In this paper, we use application-specified

detection, where the designer explicitly specifies when a given pseudo-constant

changes. One disadvantage is that this approach is error prone and requires knowledge

of pseudo-constant invalidations. However, for many applications, invalidations are

obvious. For example, for designer-specified pseudo-constants, the designer is already

aware of pseudo-constants, and is likely aware of when the application changes a

pseudo-constant (e.g., a new image). As future work, we envision the possibility of

runtime detection, which doesnt require designer knowledge, but is often impractical due

to overhead. In the general case, runtime detection requires a comparator, which may

outweigh savings except for large regions of pseudo-constant logic (e.g., large adder

trees).

37

WE

CLK

Shift In 1

WA8

WA2

WA1

A4

A3

A2

A1

A6

A5

A6A1-A5

A1-A5

Logic

Inputs

Write

Address

Inputs

Write

Ad

dre

ss D

eco

de

r

Re

ad

De

co

de

r

Shift Out

Write Data In,

Shift In 2

32x1

RAM

16-bit

Addressable

Shift Register

Re

ad

De

co

de

r

32x1

RAM

16-bit

Addressable

Shift Register

Read

Address

Inputs

Logic

Outputs

O6

O5

Figure 4-2. Functional architecture of a Xilinx Virtex 5 LUT. Each LUT can be configuredas a 64x1 dual-ported RAM, a single variable-length shift register up to32-bits long, or two independent variable-length shift registers up to 16-bitslong each.

4.2 Technology Mapping

In this section, we discuss how different pseudo-constant logic can be technology

mapped onto FPGA LUT primitives. In Section 4.2.1, we present pseudo-constant

primitives for the Xilinx Virtex 5. In Section 4.2.2, we identify architectural bottlenecks

and present extensions that would enable Virtex 5 to better support pseudo-constants.

4.2.1 Pseudo-Constant Primitives for Xilinx Virtex 5

General-purpose logic resources in Xilinx Virtex 5 devices are composed of

columns of configurable logic blocks (CLB). Each CLB is composed of two SLICEs,

each of which contains four LUTs. While devices are composed of equal numbers

of two different SLICE types, SLICEM and SLICEL, only SLICEMs have dynamically

reconfigurable LUT primitives; therefore, in this work we consider only SLICEMs.

Figure 4-2 shows the simplified functional architecture of the Virtex 5s six-input,

two-output LUT. The LUT is logically composed of two five-input, one-output (32x1)

38

random-access FIFO structures, addressed by the LUTs lower five inputs (A1:A5).

A mux uses the sixth input (A6) to select one of the lower inputs to drive the LUTs

primary O6 output, while one directly drives a secondary output O5. Output O6 may

be any combinational function of all six inputs, while O5 is a subset function of only five

inputs. Each Virtex 5 LUT can be configured as either a 64x1 or 32x2 dual-ported RAM,

with one synchronous write and one asynchronous read port. An additional six inputs

(WA[1:6]) specify the write address. Each LUT can also be configured as one 32-bit

shift register, or two 16-bit shift registers, each with addressable outputs that can select

any bit of the shift register. Xilinx refers to these shift register primitives as SRL32 and

SRL16 respectively. Figure 4-3 shows a simplified view of a SLICEM from the Virtex 5

user guide [2]. Each SLICEM contains four LUTs, referred to as A, B, C, and D. Each

LUT has six dedicated logic or read address inputs, as well as two data inputs to drive

LUT RAM and SRL inputs. LUT Ds read address inputs (D1:D6) are also used to drive

the six write address inputs for all four LUTs. As discussed later, this addressing method

is a limitation for pseudo-constant logic because LUT D cannot be efficiently used for

outputs.

Paired with each LUT is dedicated carry-chain logic and a flip-flop. Dedicated

outputs carry each LUT O6 output and flip-flop output to the routing fabric. Muxes select

from the LUT O5 output, carry-chain output, and shift-out as well to drive the flip-flop

input and a third output. The shift-out port of each LUT is connected to the shift-in port

of the next LUT (i.e., Shift Out D ? Shift In C) to create longer shift register chains, up

to 128 bits per SLICEM. Two dedicated muxes select between outputs from LUT A and

B, and LUT C and D, and a third mux selects between those two muxes. This structure

enables eight-input muxes using only two LUTs, and 16-input muxes using four LUTs.

4.2.1.1 Distributed RAM

To implement the LUT RAM pseudo-constant primitive, we use Xilinx Distributed

RAM. Each Xilinx LUT allows read and write access to the 64 SRAM bits in either

39

D

A [1:6]

WA [1:6]

D [1:6]

DI

O6O5

DIN 1

DIN 2DX

C

A [1:6]

WA [1:6]

C [1:6]

CI

O6O5

DIN 1

DIN 2CX

B

A [1:6]

WA [1:6]

B [1:6]

BI

O6O5

DIN 1

DIN 2BX

A

A [1:6]

WA [1:6]

A [1:6]

AI

O6O5

DIN 1

DIN 2AX

64x1 or 32x2

Dual Ported

RAM

256x1 Single Port

or

64x1

1xRD/WR, 3xRD

Can be

configured as

three reconfigurable

6:1 or 5:2 functions

LUT D is

consumed by

write address

inputs

Figure 4-3. All four LUTs (A-D) of a single Xilinx Virtex 5 SLICEM configured asdistributed RAM.

64x1-bit or 32x2-bit dimensions. Multiple LUTs per slice can be grouped together to

create wider or deeper memories. Because the write addresses for the four LUTs

are driven by LUT Ds six logic and read inputs, significant limitations are placed

on the efficiency of LUT RAM structures when using the Virtex 5. For example, a

dual-ported 64x1 RAM requires two LUTs, resulting in a 50% area penalty. To achieve

maximum area efficiency, a LUT RAM primitive using Virtex 5 distributed RAM should

ideally use all four LUTs in a single SLICEM. Inputs D[1:6] drive the common write

address and are used to configure LUTS A, B and C, which can then be used as three

independent LUTs, while LUT Ds inputs are consumed by serving as the write-address

for LUTs A, B, and C. Figure 4-3 shows four LUTs connected in this fashion. Using LUT

RAM, each SLICEM yields either three 6-input, 1-output functions, or three 5-input,

2-output functions. Because only one flip-flop is available to each LUT, in the case of

2-output functions, only one output can use a flip-flop, which can be a limitation for

40

pipelined logic. If inputs D[1:6] can be driven by both logic during normal operation

and configuration hardware during reconfiguration, then LUT D may also be used

for pseudo-constant based logic, eliminating the penalty. In this case, four 6:1 or 5:2

functions could be realized per SLICEM.

4.2.1.2 Shift Register

LUT shift-register primitives can be implemented using Xilinx SRL primitives.

When configuring LUTs as shift registers, configuration bits for many LUTs can be

shifted serially in a single configuration chain. Using the SRL32, a single LUT can be

configured as a five-input, one-output function. Configured as two SRL16s, each LUT

can be configured as a four-input, two-output function. Unlike SRL32, each SRL16 must

be driven by an independent configuration input; multiple SRL16 primitives cannot be

chained together in a single, long configuration chain.

4.2.2 Architectural Extensions

The pseudo-constant primitives for the Virtex 5, described above, highlight many

of the challenges of implementing pseudo-constant logic optimization on modern

FPGAs. In this section, we discuss the FPGA architectural characteristics that most

limit the effectiveness of pseudo-constant logic optimizations, particularly those of the

Xilinx Virtex 5 CLB architecture, and suggest modifications to improve the efficiency of

pseudo-constants. Pseudo-constant implementations can be viewed as a traditional

input/output-bound problem. The Virtex 5 implementations described above show

that the number of inputs and outputs to an FPGAs LUTs are a key limitation of

pseudo-constant logic packing and place an upper bound on the achievable area

reduction. Additionally, the number of inputs required to produce a given output value,

and the number of inputs shared among multiple outputs, enforce a similar limitation.

For example, in the design of an adder circuit described in the next section, the key

design limitation was the number of outputs from a LUT. While groups of four to six

input pins, producing three to five sum outputs and a carry, could drive a single LUT, at

41

most two outputs per LUT could be generated. Additionally, the availability of only one

set of fast carry logic and flip-flop per LUT, limits the achievable maximum clock speed

when using two outputs per LUT. Additionally, with LUT RAM primitives, one LUT per

SLICE is consumed solely by the use of its address pins by the RAM write address,

and cannot be used for logic. As result of these challenges, we suggest that future

FPGA architectures could be augmented to improve the viability of pseudo-constant

optimizations. For example, we have observed that modifications to improve efficiency

of wider-output functions, such as those found in many arithmetic operations, could

greatly benefit pseudo-constant optimizations. Particularly, more outputs per LUT

and a fast carry logic and flip-flop pair for each LUT output, could greatly improve the

efficiency of wide- or multi-output functions. By adding an extra set of address pins to

the SLICE to serve as the common write address input, the 25 percent loss of functional

density in LUT RAM based designs can be averted. Figure 4-4 shows a possible SLICE

architecture for a device including these modifications. Specifically, this devices SLICE

architecture is composed of four six-input, two-output LUTs identical to those of a

Virtex 5. Additionally, carry-logic and flip-flop stages identical to those available to the

O6 output of Virtex 5 LUTs are added to both outputs of each LUT. An additional fifth

set of six input pins is added to serve as a common write-address input for LUT RAM

primitives.

4.3 Experiments

To evaluate pseudo-constant logic optimization, we manually technology mapped

common logic functions onto pseudo-constant primitives for Xilinx Virtex 5 FPGAs.

Because Virtex 5, Virtex 6, and Virtex 7 devices all employ an identical CLB architecture,

the results also apply to those devices. To determine benefits, we also synthesize each

circuit without the proposed optimization to a Xilinx Virtex 5 LX50 FPGA using Xilinx ISE

14.2. For each example, we include results for only the most efficient of either a LUT

RAM or shift-register-based design. In addition to the Virtex 5, we evaluate the same

42

B OUT

AQ1

AQ2

BQ1

BQ2

CQ1

CQ2

DQ1

DQ2

A OUTAI

B [1:6]

A [1:6]

LUT A

BI

Carry Logic

Carry LogicLUT B

Carry Logic

Carry Logic

CI

D [1:6]

C [1:6]

LUT C

DI

Carry Logic

Carry LogicLUT D

Carry Logic

Carry Logic

E [1:6]Dedicated Write

Address inputs

Two Flip-Flops

per LUTTwo carry

stages per LUT

Flip-Flop B1

FF B2

Flip-Flop A1

FF A2

Flip-Flop D1

FF D2

Flip-Flop C1

FF C2

C OUT

D OUT

MUX

Figure 4-4. A modified Virtex 5 slice including enhancements to improve efficiency ofpseudo-constant optimized logic. An extra carry logic and flip-flop outputstage is added to the secondary O5 output of each LUT, and a fifth set ofaddress pins is added for dedicated write address use.

circuits on a theoretical device incorporating the modifications proposed in Section 4.2.2.

This theoretical device is composed of CLBs using the modified Virtex 5 architecture

shown in Figure 4-4, for which we assume Virtex 5 timing and switching characteristics

[1]. Note that the timing results are optimistic because the theoretical architecture

may have longer delays. Similarly, the theoretical device may have general-purpose

area tradeoffs, which are outside the scope of this study. We also evaluated support

logic to reconfigure pseudo-constant circuits. We implemented a simple programmer

to iteratively shift each bit of the pseudo-constant bitfile into the configuration bits of

each LUT. This circuit is composed of a counter and a BRAM to store the bitfile, and

consumes as few as 10 LUTs. Therefore, pseudo-constant optimizations are only

beneficial after savings more than 10 LUTs. This circuit can be used to program each

LUT sequentially, or replicated to program two or more in parallel.

In this section, we evaluate logic that is commonly replicated in large numbers by

many FPGA applications. These replicated circuits represent an appropriate usage

43

case for pseudo-constant circuits as many copies can share a small number of support

circuits for reconfiguration and invalidation. This sharing enables overhead of the

pseudo-constant support logic to be amortized over a large number of optimized circuits.

The evaluated circuits include an adder, a comparator, and a multiplexer.

4.3.0.1 32-bit Full Adder

When synthesized into FPGA LUTs, adder circuits are output-bound. Because

addition operations are wide-output functions, the key challenge in minimizing the

number of LUTs is in driving all N outputs. Synthesis in Xilinx ISE for a Virtex 5 adder

uses the dedicated fast carry logic to create ripple-carry adders. Each LUT adds the ith

bit of each input A and B, generating a sum and carry output. These outputs drive the

carry logic, which combines these signals with Ci-1 to generate a Si and Ci. If instead

the add operation had one pseudo-constant input and one normal (i.e., non-constant)

input, the pseudo-constant value can be folded into the function implemented by each

LUT to reduce the LUT utilization. In this case, the only input to each LUT would be the

ith bit of the non-constant input. Even though each LUT has several free inputs, because

both sum and carry-out signals must be generated for each bit, both LUT outputs are

consumed, and no LUTs can be eliminated from the circuit. Suppose instead three bits

of the non-constant input, [Ai,,Ai-2], along with a carry input Ci-3, were connected to

two LUTs. The four available outputs from this structure can then implement outputs [Si,

Si-2] and Ci. Because each previous bits inputs are available to the function generator

for each output bit, the internal carry values can be calculated without consuming LUT

outputs. This structure implements a 3-bit full adder using only two LUTs, rather than

three, providing a 33% area savings.

Figure 4-5: Using the SRL16-based four-input, two-output pseudo-constant

LUT primitive described above, many such pseudo-constant 3-bit full adders can be

chained together to implement wider pseudo-constant adders. Figure 4-5 shows a

32-bit adder designed using these structures. When synthesized for a Virtex 5 device

44

LUT1

LUT2

LUT3

LUT22

SRL16

SRL16

S[3]

S[2]

S[1]

A [0]S[0]SRL

16

SRL

16

SRL

16

SRL

16

Carry

Out 3

Cout

S[31] A [31]

LUT4

S[4]

SRL16S[5]

LUT5

S[6]SRL16

SRL16Carry

Out 6

Carry

Out 27

SRL

16

SRL16

A [1]

A [2]

A [3]

LUT20

S[28]

SRL16S[29]

LUT21

S[30]SRL16

SRL16Carry

Out 30

SRL16 A [30:28]

A [6:4]

Figure 4-5. An SRL16-based pseudo-constant 32-bit full adder design.

using Xilinx XST, a normal 32-bit adder consumes 32 LUTs. When synthesized using

the pseudo-constant based design, a 32-bit adder consumes only 22 LUTsan area

savings of 31%. Figure 4-6 shows how adder LUT count grows with input width for

both the traditional and pseudo-constant adders. The darker line shows how traditional

adder LUT count grows linearly, equal to the bit width. The lighter line shows how

the pseudo-constant adder LUT count grows slower and in a step-wise fashion. The

step-wise behavior, and LUT savings, is due to the fact that every other LUT generates

two bits of the output. The figure also shows that LUT savings increases as adder width

increases.

Because the Virtex 5 CLBs fast carry logic is accessible by only one output from

each LUT, the optimized design cannot benefit from the fast carry logic. Despite a

shorter overall combinational path, 11 logic stages rather than 32, the longer path

between neighboring LUTs increases the circuits combinational delay by 5x, from 2.515

45

0

4

8

12

16

20

24

28

32

36

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

LU

T C

ount

Adder Bit WidthTraditional Pseduo-Constant

Figure 4-6. A graph of LUT counts for pseudo-constant and traditionally synthesizedadders on a Xilinx Virtex 5 as adder bit width increases.

ns for traditional logic to 10.377 ns using the pseudo-constant design. Additionally,

because only one flip-flop is available per LUT, only one output from each LUT can

directly drive a pipeline register without consuming an additional route-through LUT.

When the pseudo-constant design is instead mapped onto the modified architecture

from Section 4.2.2, the 31% area savings are retained, while at the same time each

output bit can take advantage of the fast carry logic and flip-flop output stage. Thus, a

32-bit ripple carry adder can be mapped to the modified architecture using 22 LUTs with

a combinational delay of 1.343 ns. This delay for the pseudo-constant-optimized adder

is 47% faster than a traditionally synthesized adder.

4.3.0.2 Multiplexer

A pseudo-constant multiplexer can be designed similarly to the adder described

in the previous section. Using traditional synthesis methods, a four-input mux requires

one LUT on a Virtex 5. Multiple four-input muxes can be combined using dedicated

SLICE mux hardware to create up to one 16-input mux per SLICE. If the select input

to a mux were found to be pseudo-constant, using the SRL32 five-input, one-output

46

0

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

LU

T C

ount

Number of Inputs

TraditionalPseduo-Constant V5Pseudo Constant Modified Arch

Figure 4-7. A graph of LUT counts for pseudo-constant and traditionally synthesizedmultiplexers on a Xilinx Virtex 5 as the number of inputs increases.

LUT primitive, a five-input mux consumes only one LUT, and a 20-input mux can be

created in each SLICE. This design yields a 25 percent increase in functional density

over traditional synthesis. Additionally, a four-input, two-output mux can be designed

using the SRL16 four-input, two-output LUT primitive consuming only one LUT, yielding

up to 50 percent LUT savings. By taking advantage of the LUT RAM-based primitive in

the modified architectures, a six-input, one-output mux can be created using just one

LUT, with up to a 24-input mux per SLICE. This design yields a 50 percent increase

in functional density over traditional synthesis, and a 25 percent increase over the

pseudo-constant design on the Virtex 5. Figure 4-7 shows the LUT count needed for

muxes implemented with each design as the number of inputs grows, up to 32 inputs.

The figure shows a step-wise trend due to LUT counts growing in different multiples.

The unoptimized mux increases in multiples of four inputs per LUT. The pseudo-constant

mux on the Virtex 5 increases in multiples five inputs per LUT. The mux on the modified

architectures increases in multiples of 6 inputs per LUT. As muxes grow larger, the LUT

savings achieved by pseudo-constant designs increases. There is no difference in timing

performance between each design.

47

LUT1

LUT2

A [0]

SRL

32

A [1]

A [2]

A [3]

A [4]

A [8:5]SRL

32

LUT3

A [12:9]SRL

32

LUT8

A [31:29]SRL

32 EQ

Figure 4-8. An SRL16-based pseudo-constant 32-bit comparator design.

4.3.1 32-bit Comparator

A pseudo-constant comparator can be designed similarly to the adder described

above. Suppose a circuit must compare two 32-bit numbers, A and B, for equivalence.

When synthesized to the Virtex 5 architecture, this circuit requires 11 LUTs, with a

propagation delay of 4.658 ns. If input B was found to be pseudo-constant, its value

can be folded into the function implemented by the circuits LUTs. Figure 4-8 shows

a comparator design using the SRL32-based five-input, one-output LUT primitive

described above. The inputs to each LUT are comprised of a group of four consecutive

bits of the variable input, along with a carry-out from the previous group. The outputs

from these groups are cascaded together to create a 32-bit wide comparator using only

8 LUTs, vs. 11 for an area savings of 27%. The propagation delay increased to 6.556

ns. By taking advantage of the modified architecture, using the six-input, one-output LUT

RAM primitive, the pseudo-constant comparator circuit can be optimized further. The

size of each group of inputs is increased from four to five. Thus a 32-bit pseudo-constant

comparator can be synthesized on the modified architecture using only 6 LUTs, yielding

a 20 percent area decrease and a 20 percent shorter combinational delay compared

to the pseudo-constant design synthesized on the Virtex 5. The resulting delay is only

2.7% slower than the traditional comparator.

48

4.3.2 Functional Density

In [46], Wirthlin et al. present a functional density metric, D, defined as the inverse

of the product of a circuits area, A, and operating time, T, as shown in Equation 4–1:

D =1

AT(4–1)

This metric is used to quantify the benefits of circuit specialization and enable

comparison of area and performance tradeoffs. Additionally, [46] presents a specialized

form of Equation 1 for use with run-time reconfigurable circuits such as pseudo-constant

optimized circuits. By adding reconfiguration time, tconfig, divided by operations per

reconfiguration, n, to the operating time term, the metric accounts for the performance

effects of reconfiguration operations at a given invalidation frequency. Equation 4–2

shows this modified metric.

D =1

A(texec +

tconfign

) (4–2)

Figure 4-9 plots the functional density, as defined by Equation 2, for each of the

three adder circuits. The figure shows the operations between invalidations (i.e., the

inverse of invalidation frequency) decreasing logarithmically. This figure shows that

while the combinational delay overhead on the Virtex 5 architecture prevents the

pseudo-constant circuit from matching the functional density of the traditional adder

circuit, on the modified architecture the pseudo-constant circuit surpasses the functional

density of the traditional adder after only 19 operations between reconfigurations.

Additionally, reconfiguration overhead per operation reaches nearly zero after only

214 operations, a small figure considering FPGA clock frequencies in the hundreds of

megahertz. For infrequent invalidations, the functional density of the pseudo-constant

adder on the modified architecture approached 2.7x. In any pseudo-constant design

using LUT RAM or shift-register LUTs, reconfiguration can load the pseudo-constant

bitfile into each LUT either in serial or in parallel. In serial reconfiguration, each bit

49

is written into each LUT one bit at a time, one LUT at a time. This method yields

the longest reconfiguration time and the largest performance penalty. Alternatively,

during parallel reconfiguration, each bit must still be written into each LUT one bit

at a time, but all LUTs can be written on the same cycle. Parallel reconfiguration

decreases reconfiguration time by a factor of N, where N is the total number of LUTs in

the pseudo-constant circuit. Because parallel reconfiguration requires proportionally

more reconfiguration resources, a designer or synthesis tool must consider an

area-performance tradeoff between parallel and serial reconfiguration. Additionally,

the degree of parallelism can be adjusted to find an appropriate Pareto-optimal design

point for each design.

Figure 4-10 compares the functional density of each pseudo-constant 32-input

mux to traditional muxes using either fully-parallel or fully-serial reconfiguration. Longer

dashed lines show parallel reconfiguration, and dotted lines show serial reconfiguration.

Lighter lines show pseudo-constant muxes implemented on the standard Virtex 5

architecture. Darker lines show pseudo-constant designs implemented on the modified

architecture. Functional density of intermediary degrees of parallelism can be inferred

between each trend line for each architecture. All densities are shown as a ratio to

functional density of a traditional Virtex 5 mux, shown in solid black.

The results show that pseudo-constant muxes approach a functional density

of 1.25x on the Virtex 5 architecture, and 1.5x on the modified architecture, when

compared to traditional synthesis. Additionally, the graph shows that the break-even

point, at which functional density of the pseudo-constant optimized and traditional

circuits are equal, is approximately 128 operations per invalidation using fully parallel

reconfiguration, and fewer than 900 operations using fully serial reconfiguration.

50

0

0.5

1

1.5

2

2.5

3

Fun

ctio

nal D

ensi

ty

Operations Between Invalidations

Pseudo-Constant on Modified Arch

Pseudo-Constant on V5

Traditional

Figure 4-9. Functional density of apseudo-constant addercompared to a traditional adderas the invalidation frequencyincreases. Results are shownfor both the Virtex 5 andmodified architectures.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

2^0 2^2 2^4 2^6 2^8 2^10 2^12 2^14 2^16 2^18 2^20 2^22

Fun

ctio

nal D

esni

ty

Operations Between Invalidations

PC Modified Arch Parallel

PC Modified Arch Serial

PC V5 Parallel

PC V5 Serial

Traditional

Figure 4-10. Functional density of eachpseudo-constant mux designcompared to a traditional muxas the invalidation frequencyincreases. Functional densityfor each design is shown forboth fully-parallel andfully-serial reconfiguration.

51

CHAPTER 5CONCLUSIONS

Previous work introduced intermediate fabrics to address FPGA problems related

to lengthy place-and-route times and a lack of application portability. Although previous

intermediate fabric approaches achieve both application portability and significant

place-and-route speedup, the area overhead of those approaches prohibits important

use cases. To address this problem, we identified the virtual interconnect as the main

source of the overhead, and followed two complementary approaches to reduce

overhead.

After identifying multiplexers as the primary component of the interconnect, we first

performed design-space exploration to identify unconventional alternatives that could

achieve effective Pareto-optimal tradeoffs between overhead and routability. Based on

this analysis, we introduced an optimized virtual interconnect architecture that reduces

area requirements by approximately 50% and improves clock frequencies by 24%, with

a modest 16% reduction in routability.

Additionally, we sought to reduce the size of overhead due to each individual

multiplexer through pseudo-constant logic optimization. We showed that pseudo-constant

optimizations can increase functional density of common logic structures such as

multiplexers up to 1.25x. While these optimizations can apply to many other functional

elements, such as adders and comparators, the experiments also show the difficulty of

implementing pseudo-constant designs on modern FPGAs. In particular, restrictions

on dynamic reconfigurability and narrow-output functional units limit the effectiveness

of pseudo-constant optimizations. If future FPGA designs address these concerns,

pseudo-constant optimizations could be a viable method of increasing functional density

in FPGA designs, with improvements as high as 2.7x.

While these optimizations enable designers to employ intermediate fabrics in a

wider range of area-constrained applications, there is still opportunity for continued

52

improvement. Future work must address and limit the routability and flexibility penalty of

the optimized interconnect presented, as well as both the manual and automated design

challenges of integrating pseudo-constant logic optimizations. Even with a 50-75%

reduction in LUT utilization, intermediate fabrics will still have prohibitive overhead

for use cases where an FPGA is close to being fully utilized. Fortunately, the trends

towards multi-million-LUT FPGAs will lessen this problem over time. In addition, we plan

to investigate virtual interconnect that directly targets the physical FPGA interconnect

without using muxes. Such an approach could map virtual switch boxes directly onto

physical switch boxes, potentially eliminating much of the remaining overhead. However,

such an approach requires knowledge of proprietary routing architectures, and is

therefore deferred to future work.

53

REFERENCES

[1] Xilinx Virtex-5 FPGA Data Sheet: DC and Switching Characteristics, 2010.

URL http://www.xilinx.com/support/documentation/data_sheets/ds202.pdf

[2] Xilinx Virtex-5 FPGA User Guide, 2012.

URL http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

[3] Ashenhurst, R. L. “The decomposition of switching functions.” Proc. Internatl.Symp. Theory of Switching, Annals Computation Lab. vol. 29. Cambridge, Mass:Harvard University, 1957, 74–116.

[4] Athanas, P., Bowen, J., Dunham, T., Patterson, C., Rice, J., Shelburne, M., Suris,J., Bucciero, M., and Graf, J. “Wires on Demand: Run-Time CommunicationSynthesis for Reconfigurable Computing.” FPL ’07: International Conference onField Programmable Logic and Applications. 2007, 513–516.

[5] Becker, J., Pionteck, T., Habermann, C., and Glesner, M. “Design andimplementation of a coarse-grained dynamically reconfigurable hardwarearchitecture.” VLSI ’01: Proceedings of IEEE Computer Society Workshop onVLSI. 2001, 41–46.

[6] Betz, Vaughn and Rose, Jonathan. “VPR: A new packing, placement and routingtool for FPGA research.” FPL ’97: Proceedings of the 7th International Workshopon Field-Programmable Logic and Applications. London, UK: Springer-Verlag,1997, 213–222.

[7] Brant, A. and Lemieux, G.G.F. “ZUMA: An Open FPGA Overlay Architecture.” Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th AnnualInternational Symposium on. 2012, 93 –96.

[8] Callahan, Timothy J., Chong, Philip, DeHon, Andre, and Wawrzynek, John. “Fastmodule mapping and placement for datapaths in FPGAs.” FPGA ’98: Proceedingsof the 1998 ACM/SIGDA sixth international symposium on Field programmable gatearrays. New York, NY, USA: ACM, 1998, 123–132.

[9] Chen, Chau-Shen, Tsay, Yu-Wen, Hwang, TingTing, Wu, A.C.H., and Lin,Youn-Long. “Combining technology mapping and placement for delay-minimizationin FPGA designs.” Computer-Aided Design of Integrated Circuits and Systems,IEEE Transactions on 14 (1995).9: 1076 –1084.

[10] Compton, Katherine and Hauck, Scott. “Totem: Custom Reconfigurable ArrayGeneration.” FCCM ’01: Proceedings of the the 9th Annual IEEE Symposium onField-Programmable Custom Computing Machines. Washington, DC, USA: IEEEComputer Society, 2001, 111–119.

54

http://www.xilinx.com/support/documentation/data_sheets/ds202.pdf

http://www.xilinx.com/support/documentation/user_guides/ug190.pdf

[11] Coole, James and Stitt, Greg. “Intermediate Fabrics: Virtual Architectures for CircuitPortability and Fast Placement and Routing.” CODES/ISSS ’10: Proceedings ofthe IEEE/ACM/IFIP international conference on Hardware/Software codesign andsystem synthesis. 2010, 13–22.

[12] Cox, C.E. and Blanz, W.E. “GANGLION-a fast field-programmable gate arrayimplementation of a connectionist classifier.” Solid-State Circuits, IEEE Journal of27 (1992).3: 288 –299.

[13] Dehon, Andre Maurice. Reconfigurable architectures for general-purpose comput-ing. Ph.D. thesis, 1996. AAI0597715.

[14] Donthi, S. and Haggard, R.L. “A survey of dynamically reconfigurable FPGAdevices.” System Theory, 2003. Proceedings of the 35th Southeastern Symposiumon. 2003, 422 – 426.

[15] Ebeling, Carl, Cronquist, Darren C., and Franklin, Paul. “RaPiD - ReconfigurablePipelined Datapath.” FPL ’96: Proceedings of the 6th International Workshop onField-Programmable Logic, Smart Applications, New Paradigms and Compilers.London, UK: Springer-Verlag, 1996, 126–135.

[16] Eguro, Ken and Hauck, Scott. “Armada: timing-driven pipeline-aware routingfor FPGAs.” FPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th internationalsymposium on Field programmable gate arrays. New York, NY, USA: ACM, 2006,169–178.

[17] Eldredge, J.G. and Hutchings, B.L. “Density enhancement of a neural networkusing FPGAs and run-time reconfiguration.” FPGAs for Custom ComputingMachines, 1994. Proceedings. IEEE Workshop on. 1994, 180 –188.

[18] Farrahi, A.H. and Sarrafzadeh, M. “Complexity of the lookup-table minimizationproblem for FPGA technology mapping.” Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions on 13 (1994).11: 1319 –1332.

[19] Foulk, P.W. “Data-folding in SRAM configurable FPGAs.” FPGAs for CustomComputing Machines, 1993. Proceedings. IEEE Workshop on. 1993, 163 –171.

[20] Giri, A., Visvanathan, V., Nandy, S.K., and Ghoshal, S.K. “High speed digitalfiltering on SRAM-based FPGAs.” VLSI Design, 1994., Proceedings of the SeventhInternational Conference on. 1994, 229 –232.

[21] Goslin, G. “Using Xilinx FPGAs to design custom digital signal processing devices.”1995, 565–604.

[22] Grant, David, Wang, Chris, and Lemieux, Guy G.F. “A CAD framework for Malibu:an FPGA with time-multiplexed coarse-grained elements.” Proceedings of the 19thACM/SIGDA international symposium on Field programmable gate arrays. FPGA’11. New York, NY, USA: ACM, 2011, 123–132.

55

[23] Gunther, B., Milne, G., and Narasimhan, L. “Assessing document relevance withrun-time reconfigurable machines.” FPGAs for Custom Computing Machines, 1996.Proceedings. IEEE Symposium on. 1996, 10 –17.

[24] Hammerquist, M. and Lysecky, R. “Design space exploration for applicationspecific FPGAs in system-on-a-chip designs.” SOC ’08: Proceedings of the IEEEInternational SOC Conference. 2008, 279–282.

[25] Hayes, J.P. “A unified switching theory with applications to VLSI design.” Proceed-ings of the IEEE 70 (1982).10: 1140 – 1151.

[26] Kapre, Nachiket, Mehta, Nikil, deLorimier, Michael, Rubin, Raphael, Barnor, Henry,Wilson, Michael J., Wrighton, Michael, and DeHon, Andre. “Packet-Switched vs.Time-Multiplexed FPGA Overlay Networks.” Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines. 2006.

[27] Koch, Andreas. “Structured design implementation: a strategy for implementingregular datapaths on FPGAs.” FPGA ’96: Proceedings of the 1996 ACM fourthinternational symposium on Field-programmable gate arrays. New York, NY, USA:ACM, 1996, 151–157.

[28] Koehler, S., Stitt, G., and George, A.D. “Performance visualization and explorationfor reconfigurable computing applications.” ????

[29] Landy, Aaron and Stitt, Greg. “A low-overhead interconnect architecture for virtualreconfigurable fabrics.” Proceedings of the 2012 international conference onCompilers, architectures and synthesis for embedded systems. CASES ’12. NewYork, NY, USA: ACM, 2012, 111–120.

[30] Lavin, C., Padilla, M., Lamprecht, J., Lundrigan, P., Nelson, B., and Hutchings, B.“HMFlow: Accelerating FPGA Compilation with Hard Macros for Rapid Prototyping.”Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19thAnnual International Symposium on. 2011, 117 –124.

[31] Lemoine, E. and Merceron, D. “Run time reconfiguration of FPGA for scanninggenomic databases.” FPGAs for Custom Computing Machines, 1995. Proceedings.IEEE Symposium on. 1995, 90 –98.

[32] Lysaght, Patrick, Stockwood, Jon, Law, J., and Girma, D. “Artificial Neural NetworkImplementation on a Fine-Grained FPGA.” Proceedings of the 4th InternationalWorkshop on Field-Programmable Logic and Applications: Field-ProgrammableLogic, Architectures, Synthesis and Applications. FPL ’94. London, UK, UK:Springer-Verlag, 1994, 421–432.

[33] Lysecky, Roman, Miller, Kris, Vahid, Frank, and Vissers, Kees. “Firm-core VirtualFPGA for Just-in-Time FPGA Compilation (abstract only).” Proceedings of the 2005ACM/SIGDA 13th international symposium on Field-programmable gate arrays.FPGA ’05. New York, NY, USA: ACM, 2005, 271–271.

56

[34] Marshall, Alan, Stansfield, Tony, Kostarnov, Igor, Vuillemin, Jean, and Hutchings,Brad. “A reconfigurable arithmetic array for multimedia applications.” FPGA ’99:Proceedings of the 1999 ACM/SIGDA Seventh International Symposium on FieldProgrammable Gate Arrays. New York, NY, USA: ACM, 1999, 135–143.

[35] McDonald, E.J. “Runtime FPGA Partial Reconfiguration.” Aerospace Conference,2008 IEEE. 2008, 1 –7.

[36] McMurchie, Larry and Ebeling, Carl. “PathFinder: a negotiation-basedperformance-driven router for FPGAs.” FPGA ’95: Proceedings of the 1995ACM Third International Symposium on Field Programmable Gate Arrays. NewYork, NY, USA: ACM, 1995, 111–117.

[37] Mulpuri, Chandra and Hauck, Scott. “Runtime and quality tradeoffs in FPGAplacement and routing.” FPGA ’01: Proceedings of the 2001 ACM/SIGDA NinthInternational Symposium on Field Programmable Gate Arrays. New York, NY, USA:ACM, 2001, 29–36.

[38] Murgai, R., Nishizaki, Y., Shenoy, N., Brayton, R.K., and Sangiovanni-Vincentelli, A.“Logic synthesis for programmable gate arrays.” Design Automation Conference,1990. Proceedings., 27th ACM/IEEE. 1990, 620 –625.

[39] Roth, J. Paul and Karp, R. M. “Minimization Over Boolean Graphs.” IBM Journal ofResearch and Development 6 (1962).2: 227 –238.

[40] Sekanina, Lukas. Evolvable Systems: From Biology to Hardware, chap. VirtualReconfigurable Circuits for Real-World Applications of Evolvable Hardware.Springer Berlin / Heidelberg, 2003, 116–137.

[41] Shukla, Sunil, Bergmann, Neil W., and Becker, Jurgen. “QUKU: A Two-LevelReconfigurable Architecture.” ISVLSI ’06: Proceedings of the IEEE ComputerSociety Annual Symposium on Emerging VLSI Technologies and Architectures.Washington, DC, USA: IEEE Computer Society, 2006, 109.

[42] Stitt, G. and Coole, J. “Intermediate Fabrics: Virtual Architectures for Near-InstantFPGA Compilation.” Embedded Systems Letters, IEEE 3 (2011).3: 81 –84.

[43] Tsu, William, Macy, Kip, Joshi, Atul, Huang, Randy, Walker, Norman, Tung, Tony,Rowhani, Omid, George, Varghese, Wawrzynek, John, and DeHon, Andre. “HSRA:high-speed, hierarchical synchronous reconfigurable array.” FPGA ’99: Proceedingsof the 1999 ACM/SIGDA seventh international symposium on Field programmablegate arrays. New York, NY, USA: ACM, 1999, 125–134.

[44] Villasenor, J., Schoner, B., Chia, Kang-Ngee, Zapata, C., Kim, Hea Joung, Jones,C., Lansing, S., and Mangione-Smith, B. “Configurable computing solutions forautomatic target recognition.” FPGAs for Custom Computing Machines, 1996.Proceedings. IEEE Symposium on. 1996, 70 –79.

57

[45] Wang, J., Chen, Q.S., and Lee, C.H. “Design and implementation of a virtualreconfigurable architecture for different applications of intrinsic evolvable hardware.”Computers & Digital Techniques, IET 2 (2008).5: 386–400.

[46] Wirthlin, Michael J. and Hutchings, Brad L. “Improving Functional Density ThroughRun-Time Constant Propagation.” In ACM/SIGDA International Symposium onField Programmable Gate Arrays. 1997, 86–92.

[47] Yiannacouras, Peter, Steffan, J. Gregory, and Rose, Jonathan. “VESPA: portable,scalable, and flexible FPGA-based vector processors.” Proceedings of the 2008international conference on Compilers, architectures and synthesis for embeddedsystems. CASES ’08. New York, NY, USA: ACM, 2008, 61–70.

58

BIOGRAPHICAL SKETCH

Aaron Landy received the Bachelor of Science in Electrical Engineering degree

from the University of Texas at Austin in 2011, with specialization in Computer

Architecture and Embedded Systems. While at the University of Texas, he worked

under Dr. Derek Chiou to implement a low-overhead in-situ debugging framework for

FPGA applications.

In 2011, he worked in Post-Silicon Validation for the Atom System-on-Chip at

Intel Corporation in Austin, Texas. He joined the NSF Center for High Performance

Reconfigurable Computing (CHREC) at the University of Florida as a Ph.D. student

and research assistant under Dr. Greg Stitt. Aaron received the Master of Science in

Electrical Engineering degree from the University of Florida in 2013.

His research interests include reconfigurable computing, computer architecture,

and embedded systems. His current work focuses on FPGA toolflows and productivity,

particularly fast place-and-route, high-level synthesis, and FPGA virtualization.

59

INTERMEDIATE FABRICS: LOW-OVERHEAD …ufdcimages.uflib.ufl.edu/UF/E0/04/55/13/00001/LANDY_A.pdf · intermediate fabrics: low-overhead coarse-grained virtual reconfigurable fabrics

Documents