-
Cellular Automata for Structural Optimization on
Recongfigurable
Computers
Thomas R. Hartka
Thesis submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Engineering
Dr. Mark T. Jones, Chair
Dr. Peter M. Athanas
Dr. Michael S. Hsiao
May 12, 2004
Blacksburg, Virginia
Keywords: configurable computing, cellular automata, design
optimization
Copyright 2004 c©, Thomas R. Hartka
-
Cellular Automata for Structural Optimization on
Recongfigurable
Computers
Thomas R. Hartka
(ABSTRACT)
Structural analysis and design optimization is important to a
wide variety of disciplines. The
current methods for these tasks require significant time and
computing resources. Reconfig-
urable computers have shown the ability to speed up many
applications, but are unable to
handle efficiently the precision requirements for traditional
analysis and optimization tech-
niques. Cellular automata theory provides a method to model
these problems in a format
conducive to representation on a reconfigurable computer. The
calculations do not need to be
executed with high precision and can be performed in parallel.
By implementing cellular au-
tomata simulations on a reconfigurable computer, structural
analysis and design optimization
can be performed significantly faster than conventional
methods.
This work was partially supported by NSF grant #9908057 as well
as by the Virginia
Tech Aspires program.
-
Acknowledgements
I would first like to thank my advisor, Dr. Mark Jones, for his
guidance through my entire
research. Without his guidance I never would have been able to
complete this thesis.
Thanks to Dr. Athanas for serving on my thesis committee and for
making development
on the DINI board possible.
Thanks to Dr. Hsiao for serving on my committee and being an
excellent teacher.
Thanks to Dr. Gurdal and his researchers for providing the
mathematics for the cellular
automata models and for all the effort spent getting to
understand reconfigurable computing
so the equations mapped efficiently.
Thanks to all the professors and students involved with the
Configurable Computing Lab
for making it a great place to work.
Thanks to all the people that helped in the process of reviewing
and editing this thesis.
I am forever indebted to anyone who will review sixty pages of
my writing.
Thanks to everyone else who I have not mentioned that helped
with my work. I could
not have done it without the support from the people around
me.
iii
-
Contents
1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 2
1.2 Thesis organization . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 2
2 Background 4
2.1 Cellular Automata . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 4
2.2 Configurable Computers for Scientific Computations . . . . .
. . . . . . . . . 8
2.3 Limited Precision . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 10
3 System Design 13
3.1 Design Background . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 14
3.1.1 System Hardware . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 14
3.1.2 Distributed Layout . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 15
3.2 Problem Specific Design . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 17
3.3 Program Based Design . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 21
4 Results 32
iv
-
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 32
4.2 Problem Specific Design Results . . . . . . . . . . . . . .
. . . . . . . . . . . 34
4.3 Program Based Design Results . . . . . . . . . . . . . . . .
. . . . . . . . . . 38
4.4 Comparison of Designs . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 47
5 Conclusions 48
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 48
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 49
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 49
A 51
Vita 55
v
-
List of Figures
3.1 Setup of the configurable computer used for simulationing CA
model. . . . . 15
3.2 Distribution of logical CA cells among PEs. . . . . . . . .
. . . . . . . . . . 16
3.3 Arithmetic unit for Problem Specific design. . . . . . . . .
. . . . . . . . . . 19
3.4 PE layout for Problem Specific design . . . . . . . . . . .
. . . . . . . . . . . 20
3.5 Return data chains for Program Based design. . . . . . . . .
. . . . . . . . . 23
3.6 Control Unit for Program Based design. . . . . . . . . . . .
. . . . . . . . . 24
3.7 Analysis cycle flow and precision for each operation. . . .
. . . . . . . . . . . 26
3.8 Computational unit for Program Based design. . . . . . . . .
. . . . . . . . . 27
3.9 Multiply accumulator used in computational unit. . . . . . .
. . . . . . . . . 27
3.10 MSB data return chain, used for determining the most
significant ‘1’ of residuals. 28
3.11 Unit for shifting the precision of intermediate results. .
. . . . . . . . . . . . 28
3.12 Matrix accumulator used for analysis updates. . . . . . . .
. . . . . . . . . . 29
3.13 Data flow for uploading and download data to FPGAs. . . . .
. . . . . . . . 30
4.1 Diagram of CA model for perform analysis on a beam. . . . .
. . . . . . . . 33
4.2 Beam analysis problem modeled on the configurable computer.
. . . . . . . . 33
vi
-
4.3 Precision of PE vs. Percent Utilization of FPGA for Problem
Specific design 35
4.4 % Utilization of FPGA and maximum clock frequency vs. number
of PEs for
Problem Specific design. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 36
4.5 Cell updates per second vs. number of PEs for the Problem
Specific design. . 37
4.6 Actual results and results from Problem Specific design for
beam problem. . 39
4.7 Precision of PE vs. % Utilization of FPGA for Program Based
design. . . . . 40
4.8 Efficiency vs. number of inner iterations per analysis
cycle. . . . . . . . . . . 42
4.9 % Utilization of FPGA and maximum clock frequency vs. number
of PEs for
Program Based design. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 43
4.10 Cell updates per second vs. number of PEs for Program Based
design. . . . 44
4.11 Actual results and results from Program Based design for
beam problem. . . 46
A.1 Spreadsheet with position of control signals and short
description. . . . . . . 52
A.2 Spreadsheet containing update program . . . . . . . . . . .
. . . . . . . . . . 53
A.3 Spreadsheet converting signals to the form used by the
Program Based model 53
A.4 Spreadsheet containing the data values in a form that can be
loaded into
memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 54
vii
-
List of Tables
4.1 Times for operations associated with Problem Specific
analysis cycle for DINI
board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 38
4.2 Clock cycles for different phases of residual-update
analysis cycle. . . . . . . 41
4.3 Time for operations associated with analysis on Program
Based design. . . . 45
4.4 Maximum cell updates per second for both implementations. .
. . . . . . . . 47
4.5 Maximum cell updates per second and speed up for both
implementations
compared to PC. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 47
viii
-
Chapter 1
Introduction
Structural analysis and design optimization are an integral part
of many industries. Appli-
cations range from simple applications such as testing and
optimizing a support beam, to
very complicated applications such as optimizing the structure
of a car for crash resistance.
Performing the design iterations manually is very time
consuming. Therefore, a significant
amount of research has been conducted to develop efficient
methods to automate the design
process.
Traditional methods for automating design have involved running
simulations on general
purpose processors. In these methods, calculations must be
performed in high precision.
Parallelization of the calculations, where possible, is done
though expensive supercomputers
with hundreds of processors. Even with massive amounts of
computing power, the simula-
tions will usually take hours to complete.
Cellular Automata (CA) has proved to be a very powerful tool for
modeling physical
phenomena. CA models have successfully captured the behavior of
complex systems such
as fluid flow around a wing and pedestrian traffic [?, ?].
Recently, CA theory has been
extended to structural analysis and design optimization [?].
Using CA in structural models
changes analysis and design optimization into a high
parallizable form that does not require
1
-
2
high-precision calculations. This provides the potential for
significant speed-up.
1.1 Thesis Statement
Using CA provides a method to efficiently map structural design
optimization problems onto
FPGAs. By exploiting the inherent parallelism of FPGAs there is
the potential for speed-up
over general purpose processors.
To achieve this objective, distributed processing systems were
implemented on a con-
figurable computer. The system consisted of a host PC connected
to a PCI based board
with five FPGAs. Two designs were developed for the FPGAs to
rapidly iterate CA mod-
els for structural analysis. The two designs represent
significantly different approaches to
accomplishing the same objective.
The author’s contributions to this work are the following:
- developed a custom FPGA design for simulating a beam CA
model,
- developed a separate FPGA design that executes programs to
simulating CA models,
- wrote programs for the FPGA design to simulate a beam CA
model, and
- implemented a limited precision method in hardware for solving
iterative improvement
problems.
1.2 Thesis organization
Chapter 2 presents background information about CA theory,
scientific computations on con-
figurable computers, and the limited precision method used in
the designs. Chapter 3 gives
details on the two implementations developed to solve the CA
models. Chapter 4 presents
-
3
the results for each of the two implementations and comparisons
to traditional methods.
Chapter 5 summarizes the work performed for this project and the
results obtained.
-
Chapter 2
Background
This chapter presents previous work in areas related to this
research. The combined contri-
butions discussed were used in the completion of the simulation
environment and prototype
presented in this thesis.
2.1 Cellular Automata
The concept of Cellular Automata (CA) theory is to be able to
model systems with many
objects that interact [?]. The systems are divided into discrete
units, or cells, that act
autonomously. The advantage of using CA is that the behavior of
some complex systems
can be captured using relatively simple rules for each cell [?].
Attempting to reproduce this
behaviour without breaking those systems into autonomous units,
even if possible, would be
complicated.
Each cell in a CA model can be in a single state at any given
point in the simulation. The
number of states the cell may be in depends on the problem being
solved. In many models
the number of elements in the set of states is small (8 or
less), but there is a new class of
CA models that use a continuous state space. These continuous
state space CA models are
4
-
5
known as coupled map-lattice or cell dynamic schemes. The next
state of a cell is based on
an update rule, sometimes referred to as a transition rule,
which is a function of its current
state and the current state of its neighbors [?]. The collective
state of all of the cells in the
model at any given point is known as the global state [?].
Stanislas Ulam is generally credited with the first work in CA,
originally referred to
as cellular space or automata networks. John von Neumann
extended Ulam’s work and
proposed CA as a way to model self-reproducing biological
systems [?, ?]. The work of
Ulam and von Neumann provides a formal method for simulating
complex systems. Their
research, and much of the current research in CA, focused on
modeling dynamic systems in
which time and space are discrete. Each calculation of the next
state of all the cells in a
system represents a step in time [?]. A good example of this
type of CA model is Conway’s
Game of Life in which cells can be in one of two states: alive
or dead. Each update of the
global state represents a new generation of organisms [?].
There are a number of architectures for CA models, each
resulting in different behavior.
The number of dimensions of the CA model can differ greatly
depending on the system being
modeled. Models are typically one, two, and three dimensions in
practice. However, there
is no limit to the number of dimensions that can be used [?].
The number of dimensions of
the grid has a large effect on the communication network among
cells, known as the cellular
neighborhood.
In the work on two-dimensional grids there are two common
cellular neighborhoods. The
first is the von Neumann neighborhood, in which each cell
communicates only with the four
cells that are orthogonally adjacent to it. The second is a
Moore neighborhood in which a
cell communicates with all eight cells surrounding it [?].
Though von Neumann and Moore
neighborhoods are common, cells are not limited to communicating
only with those that
are adjacent. The “MvonN Neighborhood” uses the nine cells
(including the center cell)
in the Moore neighborhood as well as the four cells orthogonally
one space away from the
current cell. Additionally, the communication of the cells
within a model is not required to
-
6
be consistent throughout the model [?].
There has also been a significant amount of work investigating
non-rectangular cell sys-
tems. Gas lattice automata are a subset of cellular automata
that commonly use the FHP
model. The FHP model uses a hexagonal grid, where cells
communicate with their six im-
mediate neighbors [?,?]. The use of triangular and regular
polygonal lattices is common in
specialized applications of cellular automata because they can
better capture the behavior
of certain systems [?].
Models in which communication and update rules are consistent
throughout the model
are called uniform. Though most of the work in the area of CA
has used uniform models,
the use of non-uniform rules does not necessarily detract from
the effectiveness of using CA.
A number of experiments have been conducted to model the effect
of “damaged” areas of
a grid where cells use different rules [?]. In terms of
simulating a CA model on a serial
processor, a uniform grid has the advantage that only one update
rule is needed [?].
The grid for a CA model may be finite or infinite. In his work,
von Neumann examined
infinite grids as a method to construct a universal computer
[?]. Although von Neumann’s
work on infinite grids was theoretical, methods for representing
and calculating CA models
on infinite grids have been developed [?]. Finite grids are much
simpler to implement and
process in parallel because the maximum size of the active area
is known before processing
begins. However, the use of finite grids introduces the problem
of how to calculate cells on
the edge of the grid, known as the boundary conditions.
There are several ways to handle the processing of cells on the
edge of a finite grid. The
first method is to logically connect the cells on one edge of
the grid to cells on the opposite
edge, producing a loop. Another way to handle boundary
conditions is to use a fixed value
for cells at the perimeter of the grid. In systems with fixed
boundary conditions, the edge
cells are known as dummy cells because they do not need to be
updated. The third method
for calculating the update for edge cells is to use an update
rule that is different than that
used in the internal cells [?]. An example of a uniform rule
would be an edge cell that simply
-
7
mirrored the value of the closest internal cell. The type of
boundary condition used depends
largely on the problem being modeled.
The early work in CA theory concentrated on theoretical
computational questions, such
as computational universality. In later work, it has been used
as a method to study social,
physical, and biological systems [?]. A number of studies have
been conducted to use CA
to capture the aggregate behavior of groups of autonomous
beings, for example, car traffic,
pedestrian flow, and ant colonies [?,?,?]. In scientific
computing, many successful attempts
have been made to model such phenomenon as fluid dynamics,
chemical absorption, and
heat transfer using CA [?,?,?].
Some of the most recent work in CA has been in the field of
structural analysis and
design. The first work in this area was the development of
methods to optimize the angle
and cross-section area of trusses in a fixed structure [?].
These methods proved successful in
merging analysis and design into a CA model and showed powerful
computational properties.
This success prompted more work to extend CA theory to create
models for other structural
design problems.
A model was developed to minimize the weight of a beam needed to
prevent buckling
[?]. The beam is represented in sections, to which constraints
and external forces can be
applied independently. The cross sectional area for each section
is determined to produce the
minimum total weight of the beam. Experiments with this method
showed models converged
to the correct minimum solution. This area of CA research shows
substantial promise for
accelerating structural design optimization through parallel
computing.
-
8
2.2 Configurable Computers for Scientific Computa-
tions
FPGAs, which are usually the basis for configurable computers,
show considerable speed-up
for a variety of applications when compared to general purpose
processors. These applications
include signal processing, pattern searching, and encryption.
The use of FPGAs for these
tasks, which mainly involve bit manipulation, has shown orders
of magnitude acceleration
[?,?,?]. These accelerations are possible because the tasks can
be broken down into simple
operations. The operations can then be performed in parallel
throughout the chip.
Although pattern matching and bit manipulation have been widely
studied on FPGAs,
FPGAs have typically not been used for scientific computations.
Two significiant deterrents
in using FPGAs are the limited available programmable logic and
the slow clock speeds. In
the past, FPGAs have only been able to represent circuits that
had gate counts in the low
thousands [?]. This low gate count is restrictive for scientific
computations. For example, a
32-bit parallel multiplier could not be emulated by most of the
FPGAs in the Xilinx XC3000
family, chips that were first produced in the mid 1990s (based
on number of CLBs). This
becomes even more of a handicap because FPGAs typically operate
at clock speeds much
lower than average CPUs. A general purpose processor will
usually outperform an FPGA if
the FPGA cannot carry out parallel or deeply pipelined
operations.
These limitations of FPGAs have been greatly reduced in recent
chips because of the
much larger transistor densities. The latest Xilinx FPGAs can
emulate circuits with up
to 10 million gates [?]. With increased programmable logic, it
is possible to have many
more arithmetic units performing complicated operations in
parallel. In comparison to the
previous example, a Xilinx XC2V8000 chip, currently Xilinx’s
largest FPGA, has enough
logic to represent thirty-five 32-bit multipliers (based on
number of CLBs). Floating-point
operations continue to require a large percentage of available
resources. Still, researchers have
begun exploring scientific computations on FPGAs. A paper from
researchers at Virginia
-
9
Tech used the flexibility of FPGAs to develop representations of
floating-point numbers
that are more efficient on FPGAs [?]. In 2002, researchers
published a paper detailing the
development of a limited precision floating-point library and an
optimizer to determine the
minimum precision needed in DSP calculations [?].
Using the least precision possible is important on an FPGA.
General purpose processors
usually compute operations in higher precision than is needed
because of the limited choices
for precision. However, the fine-grain control of the logic in
an FPGA allows custom arith-
metic units of any precision. This flexibility can be extended
to dynamically controlling the
precision of different calculations on the same unit. Other work
has been presented on a
variable precision coprocessor for a configurable computer and
algorithms given for variable
precision arithmetic units [?]. Two papers have been published
investigating how to manage
dynamically varying precision and to show how the overall
runtime is substantially decreased
by using minimal precision [?,?].
CA has been used in the computer science community for some
time. In 1985, a book was
published describing implementations for CA simulations on
massively parallel computers
[?]. However, there has been little work done in trying to run
these models on configurable
computers. There have been some papers written on using CA on
FPGAs, but they all focus
on models with simple cell update rules and small state sets.
For example, the CAREM
system was developed to efficiently model CA on FPGAs [?]. The
two models published as
examples of using the CAREM system were an image thinning
algorithm and a forest fire
simulation. In both cases the models were simple, having state
set sizes of 4 or less. Other
cellular automata simulation systems have been proposed for
reconfigurable computers which
concentrated on fluid dynamics [?,?]. However, like CAREM, this
system is only capable
of handling simple models with very limited number of
states.
Custom hardware architectures were implemented by Norman
Morgolus from MIT for
processing CA. The most successful was known as the Cellular
Automata Machine 8 (CAM-
8) [?]. The CAM-8 is based on custom SIMD processors that are
connected in three
-
10
dimensions. Each processor is responsible for a section of data
in the model which is stored
in a DRAM. Processing on each cell’s data is performed using a
look up tables (LUT) stored
SRAM. This architecture shows impressive results, generating up
to 3 billion cell updates
per second. However, the LUT based processing limits models to a
fairly small state size.
There have been a number of projects which use the CAM-8 in
areas such as modeling fluid
motion [?] and gas lattices [?]. The CAM-8 is now sold
commerically.
2.3 Limited Precision
The use of configurable computers has renewed study in the area
of limited precision com-
puting. Determining the least number of bits needed for a task
was important when many
chips were custom designed and silicon was expensive. With the
rise of cheaper fabrication
methods and inexpensive, powerful CPUs, this area has became
less important. The use of
general purpose processors with dedicated floating-point units
lessens the penalty for using
floating-point for all calculations. However, as configurable
computers become popular, the
use of limited precision for calculations has again become
important [?].
All configurable computers are based on programmable logic at
some level of granularity.
Historically, the most popular type of programmable logic is the
FPGA. FPGAs have bit-
level granularity so arithmetic units can be built with any
precision. In most cases, each
additional bit of precision of an arithmetic unit will require
more chip resources. Also, the
maximum clock frequency for an arithmetic unit may decrease with
each additional bit of
precision. This high sensitivity makes using the lowest
precision possible very important to
optimizing a design on an FPGA.
Limiting precision has been extended further for FPGAs for
solving iterative problems in
a recent paper [?]. This paper describes a method for performing
low precision calculations
that are collated into high precision results. Similar ideas
were developed for CPUs, but
those studies focused on using single precision floating-point
calculations to find double
-
11
precision solutions [?,?]. As mentioned earlier, FPGAs have a
much finer grain of control
over precision, and floating-point calculations are expensive
when using FPGAs. So a new,
modified version of this concept has recently been investigated
specifically suited for use on
a configurable computer [?].
The reason that low-precision arithmetic can be used in
iterative improvement problems
is that the answer converges gradually. During each step, a
correction is found that improves
the solution. When the correction is large and the highest bits
of the solution are converging,
the low bits do not hold any useful information. Therefore,
there is no advantage to using
a precision that calculates the low order bits before the upper
bits have converged. As
the solution becomes closer to the final answer, the refinement
at each step becomes less.
Because the refinement is small, the high order bits no longer
change. At this point there is
no longer any reason to recompute the high bits of the
solution.
This property of iterative improvement problems makes it
possible to use fewer bits to
calculate the correction than the number of bits that are in the
final result. In this way,
only the high order bits are calculated while the correction is
large; inversely, only the low
order bits are computed when the correction becomes small. This
is possible by calculating
the error (residual) in the equation for the iterative
improvement problem. The goal of the
example below is to find a value for x which satisfies the
equation.
A ∗ xi = b. (2.1)
The residual (or error) in this equation can be written as
r = b − A ∗ xi. (2.2)
Instead of using the initial equation, the change in x can be
calculated
∆xi = A−1 ∗ r. (2.3)
-
12
The previous calculation can be performed with lower precision
arithmetic. This step is
iterated a number of times, then ∆xi is then added back into the
previous x
xi+1 = xi + ∆xi (2.4)
This method has been shown to converge to the correct solution
[?]. It is applicable to our
work in CA because the CA models we use for structural design
optimization are in a form
that utilizes this method. The advantage of using this method on
reconfigurable computers
comes from the fact that the bulk of the operations are
performed during the update phase.
A large number of resources can be devoted to accelerating the
update calculations because
the update can be calculated at a low precision.
-
Chapter 3
System Design
This section describes two approaches to implementing CA models
on FPGAs. Both ap-
proaches use an array of uniform, simple processing elements
(PEs) spread throughout the
chip. A large number can fit on a single FPGA because the PEs
are relatively simple. This
distributed computing is effective because of the parallel
nature of solving the CA models.
The two designs described in this section illustrate a
fundamental tradeoff in hardware
design, flexibility versus speed. The first implementation is a
custom circuit developed
to solve the analysis equation for a given design. The second
implementation executes a
program stored in memory that controls arithmetic operations.
Both designs solve the same
analysis problem.
It is important to note that the underlying theory behind the
two designs is the same. In
both cases, the design is intended to determine the displacement
and rotation of sections of
a beam given the constraints and external forces on the beam.
Though they solve the same
problem, the motivation behind each design is fundamentally
different. Therefore, although
the same equations are used for solving for the beam variables,
the form of the equations are
optimized for the specific implementation.
13
-
14
3.1 Design Background
When performing operations on an FPGA, it is much more efficient
to use fixed-point arith-
metic than floating-point arithmetic. For this reason, both of
the designs represent numbers
in fixed-point notation. The nature of CA models allows for this
type of representation. The
position of the decimal depends on the architecture and the type
of data being stored. The
number of bits of precision varies based on the operation being
performed. In both models,
intermediate values produced during calculations are stored in
increasing precision to avoid
loss of data. The data is then truncated before the final value
is stored.
These designs were developed to perform calculations for a one
dimensional CA model,
with two degrees of freedom for each cell. The arithmetic for
higher dimensional problems can
be performed without significant changes to the structure of the
PE. The main difference
in higher dimensional problems is the change in the
communication pattern. In the one-
dimensional models considered, each cell only needs to
communicate with its immediate
right and left neighbors. In the case of a two dimensional
problems, cells often need to
communicate data with four to eight neighboring cells.
3.1.1 System Hardware
The concepts presented in this thesis for solving CA models on
configurable computers can
be applied to many hardware configurations; however, both
designs were developed with a
particular system in mind. The system uses a host PC connected
over a PCI bus to a card
containing five FPGAs (see Figure 3.1). The FPGAs are all Xilinx
Virtex II - XC2V4000
chips [?]. There are several features which make the Virtex II
desirable for simulating CA
models.
The first advantage of the Virtex II is the large amount of
internal RAM distributed
throughout the chip. These internal BlockRAMs have customizable
width and depth. They
also have two ports that can independently read and write to
different addresses. Transfer-
-
15
Figure 3.1: Setup of the configurable computer used for
simulationing CA model.
ring data on and off chip is an expensive operation and is
typically the bottleneck in most
applications. By utilizing these memories, we avoid having to
transfer data to external banks
of RAM.
The second advantage of the Virtex II is the built-in
multiplication units. In the sea-of-
gates model for FPGAs, implementing multipliers is expensive.
This is especially true if the
precision is large because the size of multipliers grow with the
square of the number of bits
of precision. In the Virtex II, there is a built-in multiplier
associated with each BlockRAM.
This lends itself to the distributed processor models we
used.
3.1.2 Distributed Layout
FPGAs are designed to be as flexible as possible so they can be
used in many applications,
but this flexibility comes at a cost in terms of space and speed
for any arithmetic unit when
compared to custom VLSI. Chips such as general purpose
processors have custom designed
arithmetic units that have a significant advantage in executing
sequential operations. The
-
16
reason an FPGA has the potential for speed up versus a general
purpose processor is that it
can perform many operations in parallel or deeply pipeline the
operations.
In order to maximize the ability of an FPGA to perform
operations in parallel, as much
of the reconfigurable resources as possible should be in use at
the same time. To accomplish
this objective, both designs use many uniform, simple PEs
operating in parallel. Each PE
is responsible for calculating the next value for a section of
cells in the CA model (see
Figure 3.2).
Figure 3.2: Distribution of logical CA cells among PEs.
This distribution is simplified because each cell in the CA
model is governed by the
same equations. The arithmetic units in each PE implement the
governing equation; each
logical cell is represented by the data values that are inserted
into the equation. There is a
BlockRAM associated with each processing unit that stores the
set of data values for each
cell. The number of cells represented by a PE is determined by
the number of logical cell
data sets that can be stored in the BlockRAM.
This concept of having multiple cells per PE greatly increases
the number of logical cells
that can be represented in a design. A certain amount of chip
resources is needed to calculate
a cell update. If only one cell was represented in each PE, then
the PE could be slightly
-
17
smaller and a BlockRAM would not be needed. However, the
resources required are not
greatly increased by moving from a PE that calculates the update
for one cell. There are
enough BlockRAMs on the Virtex-II so that the number of
BlockRAMs does not limit the
number of PEs that can fit on the chip.
During a single iteration, all of the logical cells contained
within a processing unit are
updated once. The update for a cell depends on its right and
left neighbors. To calculate
the update for cells on the edge of the section of logical cells
a PE represents, the PE needs
data from the PEs to its right and left. At the end of an
iteration, each PE transfers the
data from its leftmost cell to the PE representing cells to the
left. Likewise, the data from
the rightmost cell must be transferred to the PE representing
the cells to the right.
After this transfer, each processing unit has all of the
information needed to compute the
next update for all of the cells it represents. Calculations for
all cells can start simultaneously
because the necessary information about all cells is known at
the beginning of the iteration.
In both designs, registers are placed between arithmetic units.
If an arithmetic unit required
more than one cycle to complete, a pipelined version of the
component was used. This
pipelining allows multiple cell updates to be computed
concurrently, because cells do not
need to wait until the previous cell has completely finished
processing.
3.2 Problem Specific Design
The original direction of this project was to develop a toolset
that could rapidly produce
custom hardware models based on specific problems. A designer
who wanted to use the tools
would specify the problem in a custom programming language. A
compiler would interpret
the input code and produce a custom FPGA configuration to solve
the problem. It was
expected that a toolset could be developed for creating custom
bitstreams rapidly enough
to make the system useful.
The first step in this development process was to analyze
typical CA analysis equations
-
18
and manually create an optimized layout. The equations used are
based on an analysis
problem with two degrees of freedom, v and Θ.
v̄c = (C0 ∗ (vl + vr) + C1 ∗ (Θl − Θr)) + Fc
Θ̄c = (C2 ∗ (vl − vr) + C3 ∗ (Θl + Θr)) + Mc (3.1)
The variable vc represents the v value for the current cell
being processed. The variables
vl and vr are the v values for the current cell’s right and left
neighbors. F represents an
external force. v̄c is the value of vc at the next time step.
These equations can be used to
solve a one-dimensional CA analysis problem, such as deflection
of a uniform beam. This
form of the equations was chosen because it can be mapped to a
small, linear circuit. The
main goal was to minimize the number of multiplications needed
because multiplication units
are costly in terms of space on the FPGA, .
An optimized design was built to solve these equations. For each
operation a variety of
components was considered, and multiple layouts were
investigated in the implementation
of the equations. Maximum clock frequency, latency, and size
were examined when selecting
each component. To further optimize the circuit, because they
are independent, vc and Θc
are computed simultaneously. Figure 3.3 shows the final
optimized design.
The outputs of all the components shown in Figure 3.3 are
registered. Additionally, the
constant multipliers are pipelined. The resulting latency though
the circuit is 6 clock cycles.
The circuit is designed such that all information to compute the
update value is provided
at the point at which it is needed. In particular, the Fc and Mc
values are loaded 5 clock
cycles after the corresponding Θ and v values. In this design,
when this pipeline is filled the
circuit can produce an updated value every clock cycle.
The constant multipliers were used because they had a much lower
latency and were
much smaller than traditional multipliers. Using constant
multipliers is only possible if
the coefficients in Equations 3.1 are fixed. In the case of
analyzing the deflection of a
uniform beam, these coefficients are constant. These multipliers
have the characteristic of
-
19
Figure 3.3: Arithmetic unit for Problem Specific design.
having a structure independent of the constant multiplicand.
Therefore, if the location in
the bitstream of the constant multiplier is known, the values in
the FPGA look-up tables
(LUTs) values could be modified directly to reflect changes in
the coefficient.
The disadvantage of using constant multipliers is that design
optimizations made to the
density of the beam would require that a different type of
multiplier be used. Also, if the
beam was not uniform, the value of vc and Θc would be needed to
compute updated values,
v̄c and Θ̄c. This narrows the usefulness of this design, but it
provides an optimized baseline
for comparing other designs.
Each PE in the Problem Specific design contains arithmetic
logic, a finite state machine
(FSM), and a BlockRAM. The BlockRAM contains all of the values
for the cells. The FSM
controls the addresses from which data is loaded and stored in
the BlockRAM. The PE
operates most efficiently when the pipeline is filled. When the
pipeline is filled, a new set of
data needs to be applied each clock cycle, and updated values
need to be stored each clock
cycle. To accommodate this flow of data, one port of the
BlockRAM is devoted to loading
data and the other is devoted to storing data (see Figure
3.4).
-
20
Figure 3.4: PE layout for Problem Specific design
The Edge Registers are used to communicate data to neighboring
PEs. When the update
for a cell on either end of the section of the model for which
the PE is responsible is calculated,
the new value is stored in the Edge Registers. Each PE has
access to these registers in its
right and left neighbor. When data is needed from a neighbor,
the values are loaded from
the Edge Registers instead of from the BlockRAM. For PEs on the
boundary of the model,
the Edge Registers are connected to constant values.
To implement design optimization, FPGA configuration bitstreams
need to be produced
for both analysis and design phases. The FPGA would first be
loaded with the analysis
design and the configuration would be iterated until the data
values converged. After the
cells converge, the data needed for the design improvement phase
is stored in the internal
BlockRAMs. The FPGA would next perform a partial reconfiguration
and load the design
improvement bitstream, during which the contents of the
BlockRAMs would not be changed.
In this way, data would be passed between the analysis and
design phases.
The results would be extracted from the board through readback.
During the readback
-
21
operation, the FPGA dumps its entire configuration including
flip-flops and BlockRAM
contents. Once the contents of the FPGA are dumped, careful
filtering of the data would
yield the current results. This method negates the need for
using specialized hardware to
support downloading data.
The residual-update method, described in the Background chapter,
can be used in finding
the solution for a CA model because it is an iterative
improvement problem. The advantage
to using this method would be that low precision calculations
can be used to generate a high
precision result. The reconfiguration between analysis and
design phases would provide the
opportunity needed for loading updated coefficient to the FPGAs.
The result of implement-
ing the residual-update method would be that an 8-bit design
could produce results with
precisions such as 16 or 32 bits.
3.3 Program Based Design
The Program Based design represents a fundamentally different
approach to solving the same
analysis problem as the Problem Specific design. The Problem
Specific design can perform
analysis updates very rapidly because it uses custom hardware.
However, using a custom
design means that for each new problem an optimized circuit must
be designed, and an FPGA
configuration must be generated. The overhead of building a
custom configuration for each
problem could easily erase any speed advantage. On the opposite
end of the spectrum, a
compiled program running on a general purpose CPU has very low
initial overhead, but it
cannot take advantage of the inherent parallelism of CA. The
Program Based design was
developed to bridge the gap between the analysis speed of custom
hardware and the flexibility
of a general purpose processor.
The first major change, compared to the Problem Specific model,
is that the Program
Based design executes a program stored in internal BlockRAM to
control data accesses
and the arithmetic units in the PEs. In the Problem Specific
model these operations were
-
22
performed using a fixed finite state machine. Another
significant change is that the control
logic is removed from the PEs and placed in a central control
unit. The signals are then
propagated to the PEs throughout the chip. The third important
modification is that the
equations are represented in a matrix form to provide a more
flexible architecture that
can handle a variety of problems. This matrix arithmetic is
expressed in the layout of
the arithmetic units. The last major difference is that the
Program Based model has the
capability to compute results in both high precision and low
precision forms on the FPGA
and then combine the two results.
The goal of flexibility for the Program Based design is
reflected in the form of the equations
for the model. The hardware is designed to solve problems set up
in matrix form. This
provides a simpler method to implement, and eventually automate,
CA design algorithms.
The matrix form of the beam equations is shown in the following
equations:
v̄c
Θ̄c
=
vl
Θl
C0 C1
C2 C3
+
vc
Θc
K0 K1
K2 K3
+
vr
Θr
C4 C5
C6 C7
+
Fc
MC
(3.2)
This equation solves the same analysis problem as the Problem
Specific design. This is
one of a range of two-dimensional problems that can be solved by
the Program Based design.
Equations can be implemented with any number of terms and are
expressed in matrix form.
The Problem Specific model only solves problems that can be
represented in the form of the
beam equations, while the Program Based model has the capability
to capture the behavior
of a variety of problems.
The complexity of control logic increased greatly in the Program
Based model as compared
to the Problem Specific model. The finite state machines that
controlled the load and store
logic of the Problem Specific model are ill-equipped to handle
the increase in complexity.
The control logic in the Program Based model uses significantly
more resources, so it is
advantageous to move the control logic to a centralized
location. There is a penalty involved
in distributing the control signals; however, the size of each
PE would more than double if
-
23
the control logic was not centralized.
The architecture of having a single control unit makes the
Program Based design similar to
a Single Instruction, Multiple Data (SIMD) parallel computer
(see Figure 3.5). Removing the
control from the individual PEs is possible because all cells,
including boundary cells, can be
represented by changing the coefficients in the matrix equation.
Historically, there has been
a lack of widespread interest in SIMD parallel computers because
they are inflexible and
require custom processors. However, SIMD machines have been
successful in multimedia
and DSP applications [?]. These applications involve repetitive
calculations that can be
performed in parallel, similar to those needed for CA
models.
Figure 3.5: Return data chains for Program Based design.
The Control Unit (CU) requires feedback from the PEs, for
example a flag indicating
that calculations are complete. The routing resources around the
CU would be consumed
quickly because there are a large number of PEs that need to
communicate with the CU. To
avoid this problem, there are multiple PEs on each return data
bus so only the last PE in
the chain needs to be routed directly to the CU. The drawback is
that extra computational
cycles are needed. This is because the returning data takes an
extra clock cycle to propagate
back to the CU for every link.
-
24
Instructions stored in the CU are not like those of a
traditional microprocessor. The
instructions for a traditional general purpose processor are
encoded, while the instructions
in this design are stored as a 72-bit word that requires no
decoding. The result is that
most control signals can be connected directly from the memory
in the CU to the PEs (see
Figure 3.6). This method of storing instructions has the
advantages of being both fast and
allowing any combination of control signals to achieve maximum
parallelism.
Figure 3.6: Control Unit for Program Based design.
The instructions contain two main parts, the flow control logic
and control signals. The
flow control portion interacts with the flow control logic in
the CU to determine which
instruction is executed next. The flow control logic allows for
increments to the program
counter, branches, and conditional branches. The control signals
manage operations in the
PEs. These include: clearing registers, loading data, and
shifting data. The signals for
controlling the BlockRAMs in the PEs are fed through address
logic to allow absolute and
relative address jumps.
-
25
Though there is a plan to automate the process of writing the
programs loaded into the
control unit, the first programs were written manually. A
spreadsheet was used to select the
values of every control signal at each time step. The
spreadsheet was set up to automatically
insert the signal values into the proper bit position (see
Appendix A for example). The integer
equivalent of the binary number is loaded into the control unit
memory at compilation time.
In the current design, the program cannot be changed at run
time.
To understand the reasoning behind the arithmetic logic in the
PEs, it is necessary to
understand the process for using a residual and an update to
calculate results. It is possible
to use a residual-update method, described in Chapter 2, to find
the solution because the CA
solutions are attained by iterative improvement. This method has
the advantage of using
low precision arithmetic for most calculations. In describing
this method, n is the number
of bits used in high precision calculations and k is the number
of bits used in low precision
calculations.
This method works by first calculating the residual, or error in
the equation that is being
solved. The residual calculation must be performed in n bits for
every cell in the model. The
most significant k bits of the residuals are then extracted and
stored. The k bits must be
taken from the same position in every residual. The largest
element in the residual vector
dictates which bits are selected. The update equation is then
calculated in k bits, and the
k-bit version of the residual is used in place of the Fc and Mc
in Equation 3.2. This k-bit
update is performed until the results converge. After the k bit
updates are found, they are
added into an accumulated version of the variables at the same
offset as the bits that were
taken out of the residual. The cycle repeats using the
accumulated version of the variables
in the residual equation. These iterations are repeated until
the accumulated versions of the
variables converge.
The flow chart in Figure 3.7 shows an analysis cycle using this
method. This method
is effective because the majority of the time is spent in
calculating the update in k bits.
More parallel arithmetic units can be used to speed up the
calculations because the update
-
26
Figure 3.7: Analysis cycle flow and precision for each
operation.
calculation is be performed in k bits, .
There are three main parts to the PEs used in the Program-Based
design (as shown in
Figure 3.8):
- Multiply Accumulator: calculates the residual in n bits
- Shift Unit: extracts k bits from n-bit residual and adds the
update into accumulated
variables
- Matrix Accumulator: calculates cell updates in k bits
The Multiply Accumulator is simply a multiplication unit and an
adder with the registered
version of its output connected to one of its inputs (see Figure
3.9). There is only one
multiply accumulator per PE because it uses n-bit arithmetic,
and these n-bit precision
units are large. The Multiply Accumulator takes advantage of the
built-in 18x18 multiplier
units on the Virtex-II FPGAs to save resources.
The minimization of the hardware results in multiple clock
cycles being needed to compute
-
27
Figure 3.8: Computational unit for Program Based design.
Figure 3.9: Multiply accumulator used in computational unit.
residual values. The latency through each unit is one clock
cycle, so the pipeline is two stages.
For the equation proposed in the beginning of this section, it
takes 16 clock cycles to calculate
the residuals for one cell. The expense of the residual
calculation is tolerable because many
update calculations are performed between residual
calculations.
After the residual is calculated it must be converted to a k-bit
number. During the
residual calculation, the most significant bit of the largest
residual value is found. There is
a mechanism in each PE that stores the absolute value of the
largest residual calculated.
This value is passed along the return data chain until it
arrives at the control unit. Each
PE performs a logical OR on the value passed to it and the
largest value it has calculated.
-
28
This process destroys the actual value of the largest residual,
but the number passed to the
control unit shows the position of the most significant ‘1’.
This position is used to determine
which bits of the residual are stored for the update phase.
Figure 3.10: MSB data return chain, used for determining the
most significant ‘1’ of residuals.
The logic to extract the bits is based on a multiplexer, a
register, and a right shifter. The
multiplexer selects between an input from memory or a right
shift version of the value stored
in the register. The value output from the multiplexer is loaded
into the register. During the
bit shifting phase, the register is first loaded with the n-bit
value. The right shifted value
is then selected from the multiplexor. The value is looped
through right shifter until the
desired bits are in the lowest position. The number of clock
cycles required is dependent on
the number of bit positions the value needs to be shifted.
Figure 3.11: Unit for shifting the precision of intermediate
results.
-
29
The adder, after the shift logic, is used during the addition
phase at the end of the outer
analysis cycle. During the addition phase, the update is loaded
into the highest bits and
shifted to the correct position. It is then added to the
previous value, which is read from
memory. A signal from the control unit selects which value is
output from the unit.
The final piece of the PE for the Program Based design is the
Matrix Accumulator. The
Matrix Accumulator is similar to the Multiply Accumulator unit,
except the arithmetic is
performed in k bits and more hardware is used to speed up
calculations. The unit is designed
specifically to be able to multiply a 2x2 matrix by a 2x1
matrix. For example, Figure 3.12
shows the circuit calculating the equation:
v̄
Θ̄
=
v
Θ
C0 C1
C2 C3
(3.3)
Figure 3.12: Matrix accumulator used for analysis updates.
The multiplier has a three clock cycle latency and is fully
pipelined. The entire unit has
latency of five clock cycles. The update for each cell using the
matrix version of the beam
equations, described earlier in this section, takes 9 clock
cycles. However, when the pipeline
-
30
is filled, the circuit can produce an update every five clock
cycles, and this circuit calculates
the update for both analysis variables simultaneously.
Every PE can select to have the input to Port B of the memory
connected directly to the
output of Port A of its left or right neighbor. In this way, PEs
transfer data about the cells
on the edge of the section of the CA model for which it is
responsible. This system is also
used to upload and download data from the FPGA. The PE that
calculates the values for
the cells on the left end of the model can read data from the
PCI bus, while the PE that
calculates the values for cells on the right end of the model
can write data to the PCI bus.
Figure 3.13: Data flow for uploading and download data to
FPGAs.
To upload coefficients and external forces, as well as
initializing variable values, the host
computer begins by writing the data into the memory of the
leftmost PE for the rightmost
PE. The data is then shifted through all the PEs until it gets
to the proper place. At the
same time new data is shifted into leftmost PE. Downloading the
results is a similar process,
it involves shifting the data right and reading it off the
rightmost PE. An external clock is
used to keep data transfers synchronized.
Although reconfiguration is not part of the analysis cycle, the
implementation of the
system for performing design optimization will use
reconfiguration in a number of ways.
Each analysis model has fixed connections for communicating
among PEs. It is possible
to pass data through intermediate PEs to transfer data between
PEs that are not directly
connected. However, to achieve maximum efficiency, communication
should be done over
direct connections when possible. The design system will have a
number of different analysis
models, each with a different communication pattern. When the
user specifies the initial
-
31
problem the system will select the bitstream for the most
appropriate model and load it into
the FPGAs.
Design optimization may be performed in a number of different
ways. The first possible
technique is to use reconfiguration. A bitstream developed to
perform design optimization
could be loaded on the FPGAs using partial reconfiguration. The
data would be passed
between analysis and design models through the BlockRAMs, like
the method proposed
for the Problem Specific model. Another technique would be to
use the analysis model
to perform the design calculations. Design would require new
coefficients, which could be
loaded into the FPGA using the uploading and downloading models
described earlier. The
disadvantage of this method is that the analysis design might
not be capable of performing
all of the operations needed, or the operations may be very
inefficient. The final possibility
is to use a Virtex-II Pro FPGA, which contains built-in PowerPC
processors. These internal
processors could be used to run a program to calculate the new
design values.
-
Chapter 4
Results
4.1 Problem Formulation
The results in this section are based on solving the analysis of
a CA model of a one-
dimensional beam. The model is formulated from work by
researchers at Virginia Tech
[?]. The beam is divided into cells that have two degrees of
freedom, vertical displacement
(w) and rotation (q). Each cell also has a separate vertical
thickness, which is the design
variable. The thickness of the beam is specified at the middle
of each cell, and then linearly
interpolated in between the specified points (see Figure 4.1).
Cells in the model are evenly
distributed along the beam.
There are a number of possible configurations for each cell. The
cell can have a fixed
displacement, a fixed rotation, a fixed displacement and
rotation, or it can be free in dis-
placement and rotation. External forces can be applied to any
cell. The forces can be in
the form of a vertical force (F) or a bending moment (M). These
different configurations are
represented by changing the coefficients in the equation that is
solved by each model. Using
these available cell configurations, many classical static beam
problems can be solved.
The CA model, shown in Figure 4.2, was modeled with 20 cells and
was run on both the
32
-
33
Figure 4.1: Diagram of CA model for perform analysis on a
beam.
Figure 4.2: Beam analysis problem modeled on the configurable
computer.
Problem Specific and Program Based designs. 20 cells was choosen
so the model could be
quickly simulated. The first cell in the model is a dummy cell,
for which no computations
are performed. The cells (1 and 19) on the ends of the beam have
fixed displacement and
rotation. Cell 14 has a fixed vertical displacement. All other
cells in the model are free
in displacement and rotation. There is a vertical force pushing
up on cell 9. The force is
scaled to produce a maximum displacement of slightly less than
127, so the result can be
represented in 8 bits.
-
34
4.2 Problem Specific Design Results
The designs presented in the implementation section were
intentionally developed indepen-
dently of any fixed precision for the results. There is a trade
off between the number bits of
the solution that will be calculated and the number of cells
that can be represented in the
system. In addition, larger precision results in lower maximum
clock frequency and/or an
increased pipeline length.
Another factor in changing the precision is memory access. The
BlockRAMs have pro-
grammable port widths that can accommodate some changes in
precision. The BlockRAMs’
two ports can each handle up to 36 bits and can independently
read or write. Once the data
transfer limit has been exceeded, accessing the data needed will
take multiple clock cycles
or more memories must be used in the design.
During development, multiple versions of the Problem Specific
design were built that used
different precision for calculations. Figure 4.3 shows the
growth of the size of a PE as the
precision of the calculations is increased. The graph shows that
size grows rapidly as the
number of bits is increased. This makes it very important to use
the least precision possible
for calculations.
From this data, and based on the beam problem being solved, 8
bits of precision was
chosen as likely to be the most effective. In most of the
following analysis, an 8-bit model
was studied for the Program Based design. However, this
precision could change based on
the type of problem being solved and the number of cells in the
model. In this respect,
the Problem Specific design would have more flexibility with
regard to precision than the
Program Based design because the Problem Specific design is
custom-made for each problem.
Once 8 bits was selected for the precision, the number of PEs
needed to be selected. The
maximum number of PEs that can fit on an FPGA is limited by the
programmable logic
and routing resources on the chip. However, when the chip usage
gets high, the routing gets
inefficient and the maximum frequency at which the circuit can
be clocked drops rapidly.
-
35
4 6 8 10 12 14 16 180
0.01
0.02
0.03
0.04
0.05
0.06
0.07Problem Specific design − % utilization vs. PE precision
PE precision (bits)
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00)
Figure 4.3: Precision of PE vs. Percent Utilization of FPGA for
Problem Specific design
-
36
5 10 15 20 25 30 35 40 450
0.2
0.4
0.6
0.8
1
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00)
% utilization and maximum clock frequency vs. Number of PEs
Number of PEs5 10 15 20 25 30 35 40 45
0
20
40
60
80
100
Max
imum
clo
ck fr
eque
ncy
(MH
z)
chip utilizationmaximum frequency
Figure 4.4: % Utilization of FPGA and maximum clock frequency
vs. number of PEs for
Problem Specific design.
Figure 4.4 shows the chip utilization and the maximum clock
frequency versus the number of
PEs. The number of logical cells that can be represented
increases linearly with the number
of PEs. However, the clock maximum frequency decreases gradually
as the number of PEs
increases, then drops quickly after the 35th PE.
The Problem Specific and Program Based designs vary widely in
the number of cells they
can represent and the precision of the result. In order to
compare these differing designs,
the maximum number of cell updates per second is used as a
metric. This is also used as the
metric to determine the speed-up over a program running on a
general purpose processor.
The number of cell updates per second for the Problem Specific
design is simply the
-
37
5 10 15 20 25 30 35 40 45 500
500
1000
1500
2000
2500Cell updates per second vs. # of PEs
Number of PEs
Cel
l upd
ates
per
sec
ond
(mill
ion)
Figure 4.5: Cell updates per second vs. number of PEs for the
Problem Specific design.
number of PEs multiplied by the maximum clock frequency. This is
because in the Problem
Specific implementation, each PE produces a result every clock
cycle during analysis. Fig-
ure 4.5 shows a peak in the maximum number of cell updates when
35 PEs are on the chip.
With 35 PEs the 8-bit design has a maximum clock frequency of
64.5 MHz. In comparison,
a 12-bit model with 35 PEs cannot fit on the Virtex-II 4000
FPGA.
There are additional factors that limit the cell updates per
second that can actually be
performed on the system (see Table 4.1). The most costly factors
are the reconfiguration
and readback times on the FPGA and the time it takes the host to
compute the coefficients
for a given design. These times are important because the
communication with the host is
-
38
Time (ms)
Operation 1 PE 1 FPGA DINI Board
Reconfig N/A 1190 4760
Readback not yet implement in DINI API
Host computations 1.11 39 156
Table 4.1: Times for operations associated with Problem Specific
analysis cycle for DINI
board.
done through reconfiguration and readback.
These results are dependent on the design being able to
accurately produce analysis
results. The problem described earlier in this chapter in the
Problem Formulation section
(see Figure 4.2) was modeled on the Problem Specific design. The
force was scaled so the
result would be able to be represented in 8 bits. The actual
results were calculated using a
C++ program running on a PC which used floating-point arithmetic
for all calculations. The
results show (see Figure 4.6) that the system was able to
produce results that were similar,
but not exactly correct. The mean of the error between the
actual results and the results
attained from the Problem Specific model for the displacement
and rotation were 38.4% and
41.4% respectively. This large error is due to the rounding that
takes place because fixed
point arithmetic is used. The answer would be improved, if
better accuracy is needed, by
using reconfiguration and the residual-update method for
iterative improvement.
4.3 Program Based Design Results
The Program Based design has much the same sensitivity to
precision as the Program Based
design, but the Program Based design does not have quite as much
flexibility in term of
precision. Because the full precision calculations of the
residual rely on the built-in 18x18
multipliers, it is difficult to increase the full precision
result to more than 18 bits. However,
-
39
0 2 4 6 8 10 12 14 16 18 20−40
−20
0
20
40
60
80
100
120Actual results vs Problem Specific design results
Dis
plac
emen
t (m
m)
Position (m)
actual displacementProb Spec displacementactual rotationProb
Spec rotation
Figure 4.6: Actual results and results from Problem Specific
design for beam problem.
-
40
4 6 8 10 12 14 16 180
0.01
0.02
0.03
0.04
0.05
0.06
0.07Chip utilization vs. PE precision
PE precision (bits)
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00
Figure 4.7: Precision of PE vs. % Utilization of FPGA for
Program Based design.
there is some flexibility in the precision of the update
calculations. Figure 4.7 shows how
the size of the PE grows as the precision of the update
arithmetic is increased. The growth
is similar to that of the Problem Specific model.
Based on this data, 6 bits was selected for the precision of the
update calculations. The
6-bit precision design attains the maximum per second with 60
PEs. The maximum clock
frequency is 94.8 MHz. If the same design used 8 bits of
precision the maximum clock
frequency is 88.1 MHz.
The precision selected for the update is also trade off between
having smaller update units
and having to perform the outer iteration more often. When the
precision of the update
calculation is larger, more inner iterations are performed
before new residuals need to be
calculated. Using a smaller precision has the advantage of being
able to devote more, smaller
units to calculating the update. The number of clock cycles
needed for each phase of the
-
41
Analysis Cycle Phase Clock Cycles
Residual Calc 550
Shift Residual 330-990
Cell Update 190 * Inner Iterations
Add 410-1135
Table 4.2: Clock cycles for different phases of residual-update
analysis cycle.
analysis cycle is shown in Table 4.2.
As the number of inner iterations increases during each analysis
cycle, the Program Based
model becomes more efficient. Figure 4.8 shows the increase in
the efficiency of the design
versus the number of inner update iterations for each residual
calculation. For this graph,
the average number of cycles for the Shifting and Adding phase
of the analysis cycle was
used. When the number of inner iterations is below 10, more than
half the time is spent in
calculating the residual or shifting the results. However, the
model rapidly becomes more
efficient. With 35 inner iterations per analysis cycle this
design achieves 75% efficiency, and
at 90 inner iterations the efficiency is 90%. The number of
iterations needed will depend on
the type of problem and the number of cells in the model.
The total cell updates per second that can be performed with the
whole chip is dependent
on the number of PEs on the FPGA. The maximum number of cell
updates per second for the
Program Based design is achieved slightly before the chip
resources are exhausted because
of routing congestion (see Figure 4.9). Figure 4.10 shows how
the maximum number of cell
updates per second rises then deteriorates as the number of PEs
is increased.
The Program Based model depends on communication with a host
through the PCI bus
in the current design. Before the calculation can begin, the
coefficients for a specific design
need to be loaded into each of the PEs. Then, after the analysis
is complete, the results
need to be downloaded back to the host. The transfer times for
these operations are shown
-
42
0 50 100 150 200 2500
10
20
30
40
50
60
70
80
90
100Efficiency vs. inner iterations per analysis cycle
Inner iterations per analysis cycle
Effi
cien
cy (
% o
f tim
e sp
ent c
alcu
latin
g up
date
s)
Figure 4.8: Efficiency vs. number of inner iterations per
analysis cycle.
-
43
0 10 20 30 40 50 60 700
0.5
1
% c
hip
utili
zatio
n (b
ased
on
Virt
ex II
−40
00)
% utilization and maximum clock frequency vs. Number of PEs
Number of PEs0 10 20 30 40 50 60 70
0
100
200
Max
imum
clo
ck fr
eque
ncy
(MH
z)
chip utilizationmaximum frequency
Figure 4.9: % Utilization of FPGA and maximum clock frequency
vs. number of PEs for
Program Based design.
-
44
10 20 30 40 50 60 700
500
1000
1500Cell updates per second vs. # of PEs
Cel
l upd
ates
per
sec
ond
(mill
ion)
Number of PEs
Figure 4.10: Cell updates per second vs. number of PEs for
Program Based design.
-
45
Time (ms)
Operation 1 PE 1 FPGA DINI Board
Uploading Coefficients 2.10 114 228
Downloading Results .311 18.5 37
Host Computations .360 21.5 86.0
Table 4.3: Time for operations associated with analysis on
Program Based design.
in Table 4.3. This communication is synchronized by an external
clock supplied by the host.
There is additional overhead due to the computation of analysis
coefficients on the host for
each design.
The problem shown in Figure 4.2 was modeled on the Program Based
design. This was
the same model run on the Problem Specific design, except the
force is scaled up so that
the maximum result was closer to an 18 bit number. The results
were again compared to a
C++ simulation that used floating-point for all calculations.
Figure 4.11 shows the results
attained from the Program Based model were very close to the
actual results. The mean of
the percent error between the Program Based model data and the
actual results was 0.099%
for displacement and .118% error rotation.
The results for the Program Based model were much more accurate
than the results from
the Problem Specific model because the Program Based model uses
18 bits of precision
during the residual calculations. The speed of the Program Based
model comes from the use
of only 6 bits for the update calculations. The external force
was scaled so the maximum
displacement would be close to 18 bits. After they were
computed, the results were scaled
down to match the original problem.
-
46
0 2 4 6 8 10 12 14 16 18 20−40
−20
0
20
40
60
80
100
120Actual results vs Program Based design results
Position (m)
Dis
plac
emen
t (m
m)
actual displacementProg Based displacementactual rotationProg
Based rotation
Figure 4.11: Actual results and results from Program Based
design for beam problem.
-
47
Maximum Cell Updates per Second (millions)
Operation 1 PE 1 FPGA DINI Board
Problem Specific 65.1 2279 9116
Program Based 18.9 1137 4548
Table 4.4: Maximum cell updates per second for both
implementations.
Operation Maximum Cell Updates Sec (millions) Speed-up
PC 48.9 -
Problem Specific 9116 186.4
Program Based 4548 93.0
Table 4.5: Maximum cell updates per second and speed up for both
implementations com-
pared to PC.
4.4 Comparison of Designs
Most of the results presented in the earlier sections were for
systems using one FPGA.
However, the DINI board intended for this system has 4 FPGAs.
The total computing
power increases linearly because all the FPGAs can run in
parallel. Table 4.4 shows the
maximum number of updates for each of the systems for each level
of the system.
To compare these FPGA based designs to conventional methods, a
C++ program was
written to calculate the results. To make the comparison as fair
as possible, the program
uses integer arithmetic instead of slower floating-point
arithmetic. Integer computations are
closer to the fixed point math used by the FPGA designs. The
program was compiled using
GCC with optimization enabled and executed on a PC with a 1.7
GHz processor and 1 GB
of RAM running Linux Debian. Table 4.5 shows the speed up
attained by the FPGA designs
over the general purpose processor version.
-
Chapter 5
Conclusions
5.1 Summary
Cellular Automata (CA) theory has been studied for decades, with
most of the work done
on modeling natural systems. Recently, the use of CA theory has
been extended to provide a
system for structural analysis and design optimization. This
work has proved to be success-
ful, but the calculations are very slow on traditional general
purpose processors. The parallel
nature of these CA models makes them a good candidate for
implementation on a reconfig-
urable computer. The work presented in this thesis shows the
initial steps toward making
an automated tool to perform structural design optimization
accelerated by a reconfigurable
computer.
The contribution of this thesis was to design and implement two
models for the analysis
phase of the CA structural design optimization cycle. Both
designs take advantage of the
parallel nature of cellular automata by using a distributed
array of processing elements.
For the Problem Specific implementation, these processing
elements are customized to each
problem. The Program Based implementation has a more flexible
processing unit that
is controlled by a program designed to simulate a specific
cellular automata model. The
48
-
49
Program Based implementation also has the built-in capability to
use a residual-update
method to accelerate calculations and improve accuracy.
5.2 Results
The results show the Problem Specific design and the Program
Based design were able to
generate cell updates at the rate of 9.12 and 4.55 billion per
second, respectively. Though the
Problem Specific design proved to be able to generate updates
more rapidly, this increase
in speed came at the expense of precision and flexibility. The
Program Based model’s
competitive speed, improved accuracy, and ability to handle a
range of update rules make it
the architecture that provides the most potential for an
automated system.
Both hardware implementations of these CA model for structural
analysis were very
successful in term of performance. When compared to a 1.7 GHz
Pentium 4 processor, the
Problem Specific design proved to be 186 times faster. The
Program Based design, which
was slightly slower, was still 93 times faster than the general
purpose processor version.
These speed-ups are a step towards making a CA system for
structural design optimization
that significantly outperforms traditional methods.
5.3 Future Work
There are a number of interesting areas that need to be studied
in order to design an
automated tool for perform structural design optimization using
CA. The most immediate
may be the need for a translator and compiler for the programs
used by the Program Based
design. For the work in this thesis, the programs were all
written by hand. This process was
very difficult and time consuming. A compiler is needed to take
a higher level abstraction and
generate machine level instructions. The end product should be a
compiler that could take a
problem specified by a design engineer who has knowledge of the
hardware implementation
-
50
and produce the necessary instructions.
Additionally, an efficient method for design calculations must
be implemented. In the
current system, the results are downloaded to a host computer
where design calculations can
be performed. However, this is an inefficient technique. A
number of possibilities exist for
executing the design calculations on the board, such as using
partial reconfiguration or on
board processors. These possibilities need to be investigated to
identify the best method.
Another area that needs work is implementing multi-grid on the
system. The multi-grid
approach to these CA problems would be to calculate results
while varying the resolution of
the grids. In other words, the number of cells representing the
system would increase and
decrease based on certain algorithms. Multi-grid could also be
used to blend analysis and
design steps into a single cycle. These methods have the
potential for huge reductions in the
number of calculations needed.
-
Appendix A
This appendix gives an example of how the programs for the
Program Based model are
written.
The Processing Elements (PEs) in the Program Based model were
developed to perform
high and low precision arithmetic and convert between the two
forms. The control logic
needed to simulate a cellular automata model is complex, so
programs are used to set the
control signals. The program is stored in an internal BlockRAM
contained in the Control
Unit(CU). As the PEs were designed, the control signals for each
unit were assigned to
particular bits of the BlockRAM in the CU. The position of the
bits and a short description
of their function was recorded in a Excel spreadsheet. Figure
A.1 shows a screenshot of this
spreadsheet.
Each phase of the analysis cycle was written as a separate
program. There are control
signals for each functional unit of the PE, but only one
functional unit is in use during
each phase. The first step in writing a program was to identify
the signals needed for the
particular phase. The pertinent signals were placed across the
top of a spreadsheet and the
value of the signal was specified below. Each horizonal line
represents a clock cycle step.
Figure A.2 shows an example of a program. This particular
program calculates the update
during the inner iteration of the analysis cycle.
The signals on the left are used to determine the order in which
instructions are executed.
51
-
52
Figure A.1: Spreadsheet with position of control signals and
short description.
The program counter will increment by one unless a loop is
specified. The signals on the left
control the PEs. Signals with only two options are specified as
Y or N. Signals with more
options are specified as a number or letter from a certain
set.
There is a second spreadsheet which determines the numerical
value for each signal. The
values of the signals are then converted into an intermediate
form. The intermediate form is
the numerical value of the signal multiplied by two to the power
of its bit position. The final
value is the sum of all the intermediate values (see Figure
A.3). This is the number that is
loaded into the CU BlockRAM. These final values are then put in
a form that can be read
into memory (see Figure A.4).
-
53
Figure A.2: Spreadsheet containing update program
Figure A.3: Spreadsheet converting signals to the form used by
the Program Based model
-
54
Figure A.4: Spreadsheet containing the data values in a form
that can be loaded into memory.
-
Vita
Thomas Hartka was born in June 1980 in Baltimore, Maryland. He
atttended from Arch-
bishop High School in Severn, Maryland. Thomas enrolled in the
College of Engineering at
Virginia Tech in the fall of 1998. He graduated Cum Laude with a
Bachelor of Science in
Computer Engineering. Thomas choose to remain at Virginia Tech
to pursue his Master’s
Degree. He became involved in research at Virginia Tech
Configurable Computing Lab. After
graduating, Thomas will attend Johns Hopkins’ Post-Baccalaureate
Premedical Program.
55