-
Stream Computations Organized for Reconfigurable
Execution(SCORE): Introduction and Tutorial
Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury
MarkovskiyAndré DeHon, John Wawrzynek
U.C. Berkeley BRASS group
August 25, 2000 / Version 1.0
Abstract
A primary impediment to wide-spread exploitation
ofreconfigurable computing is the lack of a unifying com-putational
model which allows application portability andlongevity without
sacrificing a substantial fraction of theraw capabilities. We
introduce SCORE (Stream Computa-tion Organized for Reconfigurable
Execution), a stream-based compute model which virtualizes
reconfigurablecomputing resources (compute, storage, and
communica-tion) by dividing a computation up into fixed-size
“pages”and time-multiplexing the virtual pages on available
phys-ical hardware. Consequently, SCORE applications canscale up or
down automatically to exploit a wide rangeof hardware sizes. We
hypothesize that the SCORE modelwill ease development and
deployment of reconfigurableapplications and expand the range of
applications whichcan benefit from reconfigurable execution.
Further, we be-lieve that a well engineered SCORE implementation
can beefficient, wasting little of the capabilities of the raw
hard-ware. In this paper, we introduce the key components of
theSCORE system.
1 Introduction
A large body of evidence exists documenting the rawadvantages of
reconfigurable hardware such as FPGAsover conventional
microprocessor-based systems on se-lected applications. Yet
reconfigurable computing remainsin limited use, popular primarily
in application-specific do-mains (e.g. [23] [32] [38]) or as a
replacement for ASICs
A significantly reduced extended abstract of thisarticle appears
in FPL’2000 and is copyright Springer-Verlag. This expanded version
can be found at
for rapid prototyping and fast time-to-market. This lim-ited
popularity is not due to any lack of raw hardware ca-pability, as
million-gate devices are readily available [37][2], and we have
seen recent advances in high clock rates[30] [28], rapid
reconfiguration [29] [16] [14], and high-bandwidth memory access
[24] [19] [25]. Rather, we be-lieve that the limited applicability
of reconfigurable tech-nology derives largely from the lack of any
unifying com-pute model to abstract away the fixed resource limits
ofdevices which, otherwise, restrict software expressibilityas well
as longevity across device generations.
Existing targets are non-portable. Software for recon-figurable
hardware is typically tied to a particular device(or set of
devices), with limited source compatibility andno binary
compatibility even across a vendor-specific fam-ily of devices.
Redeploying a program to bigger, next-generation devices, or
alternatively to a smaller, cheaperor lower-power device typically
requires substantial hu-man effort. At best, it requires a
potentially expensive passthrough mapping tools. At worst, it
requires a significantrewrite to fully exploit new device features
and sizes. Incontrast, a program written for microprocessor systems
canautomatically run and benefit from additional resources onany
ISA-compatible device, without recompilation.
Existing targets expose fixed resource limitations. Theexposure
of fixed resource limitations in existing pro-gramming models tends
to impair their expressiveness andbroad applicability. In such
programming models, an ap-plication’s choice of algorithm and
spatial structure is re-stricted by the size of available hardware.
Furthermore, acomputation’s structure and size must be fixed at
compiletime, with no allowance for dynamic resource
allocation.Hence algorithms with data-dependent structures or
poten-tially unbounded resource usage cannot be easily mapped
-
to reconfigurable hardware.1
Virtualize resources The SCORE compute model intro-duced in this
paper addresses the issue of fixed resourcelimits by virtualizing
the computational, communication,and memory resources of
reconfigurable hardware. FPGAconfigurations are partitioned into
fixed-size, communicat-ing pages which, in analogy to virtual
memory, are “paged-in” or loaded into hardware on demand. Streaming
com-munication between pages which are not simultaneouslyin
hardware may be transparently buffered through mem-ory. This scheme
allows a partitioned program to run onarbitrarily-many physical
pages and to automatically ex-ploit more available physical pages
without recompilation.With proper hardware design, this scheme
permits binarycompatibility and scalability across an architectural
familyof page-compatible devices.
Convenient and Efficient Model For software to ben-efit from
additional physical resources (pages), the pro-gramming model
should expose (page-level) parallelismand permit spatial scaling.
SCORE’s programming modelis a natural abstraction of the
communication which oc-curs between spatial, hardware blocks. That
is, the dataflow communication graph captures the blocks of
com-putation (operators) and the communication (streams) be-tween
them. Once captured, we can exploit a wealth ofwell-known
techniques for efficiently mapping these com-putational graphs to
arbitrary-sized hardware. Further-more, run-time composition of
graphs is supported, en-abling data-driven program structure,
dynamic resource al-location, and the integration of separately
compiled or de-veloped library components.
Section 2 of this paper discusses other systems andcompute
models which have influenced the formulationof SCORE. Section 3
presents the key components of theSCORE model. Section 4 discusses
the hardware require-ments for a SCORE implementation and why they
are rea-sonable in today’s technology. Section 5 gives a brief
in-troduction to programming constructs for SCORE. Sec-tion 6 show
an execution sample, and Section 7 describesthe basic architecture
for our current implementation of theSCORE run-time system. Section
8 shows results from aJPEG encoder in SCORE as an example of our
early expe-rience implementing a SCORE system.
1Data-dependent computational structures can be constructed via
spe-cialization and recompilation, as in [38], but this requires a
complete passthrough mapping tools.
2 Related Work
The technique of time-multiplexing a large spatial de-sign onto
a small reconfigurable system was demonstratedby Villasenoret al.
[31]. By hand-partitioning a partic-ular design (motion-wavelet
video coder) into a graph ofFPGA-sized “pages” and manually
reconfiguring each de-vice with those pages, they were able to run
the design onone third as many devices (i.e. physical pages) as
wereoriginally required with only 10% performance overhead.The key
to this approach’s efficiency was to amortize thecost of
reconfiguration by having each page process a siz-able stream of
data (buffered through memory) before re-configuring. SCORE aims to
automate the partitioning andefficient dynamic reconfiguration
performed manually byVillasenor.
The ease and success of such automation depends onappropriate
models for program description and dynamicreconfiguration. In this
regard, SCORE builds on prior artdeveloping ISA, data flow,
distributed, and streaming com-putation models. In the remainder of
this section, we dis-cuss the relation of SCORE to these prior
models.
ISA ModelsThe first attempts to define a “compute model” for
recon-figurable computing devices were focussed on augmentinga
traditional processor ISA withreconfigurable instruc-tions. PRISC
[27] (and later Chimæra [15]) allowed thedefinition of
single-cycle, Programmable Function Unit(PFU) operations using a
TLB-like management and re-placement scheme tovirtualize the space
of PFU instruc-tions, exploiting local, dynamic reuse of PFU
instructions.The size of the PFUOP, itself, however, was fixed by
thearchitecture and PFUOPs are constrained by the sequentialISA to
execute sequentially. Hence, the model does not di-rectly, allow
the architecture to scale and exploit additionalparallel
hardware.
DISC [36] and GARP [16] expand the PRISC modelto allow
variable-size and multiple-cycle array configu-rations. These
architectures can pack multiple configu-rations (instructions) into
the available array and, in thecase of GARP, support an
implementation dependent num-ber of cached array configurations.
However, like PRISC,each array configuration must be smaller than
the available,physical logic, and reconfigurable instructions can
only becomposed sequentially in the ISA. Consequently, these
ar-chitectures also prevent one from scaling array size
andautomatically exploiting the additional parallel hardware.
OneChip [19] expands the ISA extension model fur-ther by
allowing scoreboarded operations from memoryto memory in the ISA.
While still based on a sequentialISA computing model, this
potentially facilitates the use ofmultiple, parallel RFUs. As long
as each RFUOP operates
2
-
on independent memory banks, the RFU operations willnot
interlock and may proceed in parallel. This technique,however,
exposes the memory buffers between pipelinedand chainable
operations. It forces the user or compiler topick blocking factors
and to schedule blocked operationsin parallel in the processor’s
instruction dispatch window.In fact, this approach prevents direct
pipeline assembly ofchained operations. Furthermore, the ISA forces
the com-piler to schedule the invocation of RFU operations,
limit-ing the opportunity to schedule components of the
recon-figurable computation in a data-driven manner.
Dynamic ReconfigurationLing and Amano [22] describe the
Multi-Processor WAS-MII, a scalable FPGA-based architecture which
partitionsand time-multiplexes large applications as
FPGA-sizedpages, much like SCORE. The primary limitation to
WAS-MII’s performance is that page communication is bufferedthrough
a small,fixed set of device registers (the “tokenrouter”). With
such a small communication buffer, a pagecan operate for only a
short time before depleting availableinputs or output space and, if
the page is time-multiplexed,triggering reconfiguration. Hence when
running a designwhich is larger than available hardware, execution
timemay be dominated by reconfiguration time. Brebner [5][6]
proposes a similar demand-paged, reconfigurable sys-tem based on
arbitrary-sizedswappable logic units(SLUs)which communicate through
periphery registers2 and aresubject to the same inefficiency as
WASMII when time-multiplexed. SCORE avoids this inefficiency by
allowinglarge (unbounded) communication buffers, enabling
longerpage execution between reconfigurations.
CMU’s PipeRench [14] defines a reconfigurable fab-ric paged into
horizontalstripeswhich communicate ver-tically as a pipeline. The
execution model fully virtual-izes stripes and enables hardware
scaling to any numberof physical stripes. Although stripes
communicate throughinput-output registers as in WASMII, PipeRench’s
stripe-sequential, pipelined reconfiguration scheme hides the
ex-cessive reconfiguration overhead seen in WASMII. Thissequential
reconfiguration scheme is well suited to sim-ple, feed-forward
pipelines. However, this scheme doesnot support computation graphs
with feedback loops, andit may waste available parallelism when
squeezing widegraphs into a linear sequence of stripes. In
particular, whenvirtualizing a computation with more parallelism
than isavailable in a single architected stripe,
non-communicatingstripes which simultaneously fit into hardware
must still
2In Brebner’s “parallel harness” model, SLUs are arranged in a
meshand communicate with nearest neighbors via periphery registers.
In thedata-parallel “sea of accelerators” model, SLUs do not
communicate witheach other and so would not incur the same
virtualization overhead dis-cussed above.
load in sequence, incurring added latency and an area costfor
buffering stripe I/O. SCORE makes no such restric-tions on
execution order, allowing parallel reconfigurationof physical
pages.
Data FlowThe original Dennis formulation of data flow [11]
[10]described a processor ISA which represented data flowgraphs
directly, each instruction being an operator. Theexecution model
included only a single result register perinstruction, allowing an
instruction to execute only once ata time before its successors
must execute. While this re-striction on instruction ordering is
reasonable for a micro-processor where large instruction store and
fast instructionissue are available, it is not reasonable for a
reconfigurabledevice where reconfiguring on each instruction issue
is toocostly. Iannucci’shybrid data flow[18] and Berkeley’sTAM [9]
define operators by straight-line blocks of instruc-tions, relaxing
the frequency of inter-instruction synchro-nization to only the
entry and exit points of blocks. Nev-ertheless, these models
inherit the same problem of fixedcommunication buffers as Dennis
data flow and thus facethe same inefficiency as WASMII in a
time-multiplexed re-configurable implementation.
Streaming formulations of data flow remove the lim-itation of
fixed input-output buffers, allowing arbitrarilymany tokens to
queue up along an arc of a data flow graph.This generalization
allows a time-multiplexed implemen-tation to fire an operator many
times in succession beforereconfiguring, amortizing the cost of
reconfiguration over alarge data set. Lee’s synchronous data flow
(SDF) [21] [3]incorporates streaming for the restricted case of
static flowrates. Although this model of computation is not
Turing-complete (it lacks data-dependent control flow), it
guaran-tees that conforming graphs can be statically scheduled
torun with bounded stream buffers.
Buck’s integer-controlled data flow (IDF) [7] incorpo-rates
data-dependent control flow by adding to SDF aset of canonical
dynamic-rate operators (e.g. switch, se-lect). SCORE permits a
dynamic-rate model, allowingdata-dependent control flow inside any
operator. As such,SCORE programs are essentially equivalent to IDF
in ex-pressiveness, since a SCORE operator is equivalent to anIDF
graph containing dynamic operators.
SCORE shares a gross similarity to heterogeneous sys-tems which
use streaming data flow to tie together ar-bitrary processors
(conventional, special-purpose, and/orreconfigurable) including
MIT’s Cheops [4], MagicEight[35] [34], and Berkeley’s Pleiades [26]
[1]. The pro-gramming model of these systems is more restricted
thanSCORE, typically based on a pre-defined set of
streamingoperations. Furthermore, SCORE provides a stronger
ab-stract model allowing pages (processors) to be swapped as
3
-
needed and hiding implementation limitations like
buffersizes.
Streaming APIsVirginia Tech [20] defines a streaming API for
processor-controlled networks of reconfigurable devices (e.g.SLAAC
[8]). While standardizing the form in which ap-plications may be
written, this API does not, in and of it-self, virtualize the size
or fabric of compute resources andhence does not allow the
definition of portable and scalabledesigns. Rather, it serves only
as a hardware interface layerto manually reconfigure devices on the
network.
Maya Gokhale [13] defines a C-based streaming pro-gramming
model. Like SCORE, Streams-C exploits thatfact that reconfigurable
hardware is efficiently organizedas a collection of spatial
pipelines and that streams pro-vide a natural abstraction for the
hardware linkage betweentwo separate design components.
Nevertheless, Streams-C only serves as a convenient way to
compactly describespatial designs. No virtualization is performed,
and theburden of handling placement and fixed buffer size
restric-tions is placed entirely on the programmer. In these
re-gards, SCORE attempts to provide a much higher levelprogramming
model, providing semantics which are de-coupled from hardware
artifacts, like buffer sizes and phys-ical hardware size, and
automatically filling in these lowerlevel details at compile and
run time.
CSPIn many ways the SCORE computational model is simi-lar to
Hoare’s Communicating Sequential Processes (CSP)[17]. Each SCORE
operator can be viewed as a sin-gle process. These operators
communicate with eachother via designated stream connections
somewhat likeCSP’s named ports. Unlike CSP ports, SCORE streamsare
buffered and offer an unbounded stream abstraction.Significantly,
SCORE operators, unlike CSP processes,are not allowed to be
non-deterministic. Composition ofSCORE operators always yields
deterministic, observableresults. In fairness, most of CSP’s
non-determinism is tofacilitate modeling of unpredictable, dynamic
effects inreal systems, and most of SCORE could be modeled ontop of
CSP. SCORE also allows dynamic construction ofcomputational graphs,
which was not in the original CSPformulation, but could of course
be added.
3 SCORE Computational Models
A compute model defines the computational seman-tics that a
developer expects the physical machine to pro-vide. The compute
model itself is abstract but captures the
essence of how computation proceeds, defining the mean-ing of
any computation. The compute model is given amore concrete
embodiment in one or moreprogrammingmodels. The programming model
provides a high-levelview of application composition and execution,
adding anumber of practical conveniences for the programmer.
Ul-timately, both models are grounded in anexecution modelwhich
defines the way the computation is actually de-scribed to the
physical hardware and the meaning associ-ated with any such
description.
The execution model, programming model, and
abstractcomputational model are all consistent views of
computa-tion. What differs among them is the level of detail
whichthey expose or abstract (See Figure 1 and 2). The exe-cution
model abstracts the number of key resources (e.g.ALUs, pages) to
allow scaling across different hardwareplatforms. The programming
model abstracts architecturalcharacteristics found in the execution
model (e.g. ISA de-tails, limited resource sizes exposed at
architectural level).The compute model abstracts away the concrete
syntax andprimitives provided by a particular programming
languageor system.
3.1 Compute Model
A SCORE computation is a graph of computation nodes(operators)
and memory blocks linked together by streams.Streams provide
node-to-node communication and aresimply single-source, single-sink
FIFO queues with un-bounded length. Graph nodes (operators) are of
two forms:(1) Finite-State Machine (FSM) nodes which interact
withthe rest of the graphonly through their stream links; and(2)
Turing complete (TM) nodes which support resourceallocation in
addition to stream operations.
SCORE FSMs have the property that the present stateidentifies a
set of inputs to be read from the input streams.Once a full set of
inputs is present, the FSM consumes theinputs from the appropriate
set of input FIFOs and mayconditionally emit outputs or close input
or output streams.As with any standard FSM, SCORE FSMs transition
to anew state based on their inputs and present state. EachSCORE
FSM has a distinguisheddonestate into which itmay enter to signal
its completion and to remove itself fromthe running
computation.
A SCORE TM node is similar to a SCORE FSM nodebut adds the
ability to allocate memory and to create newgraph nodes (FSM or TM
operators) and edges (streams)in the SCORE compute graph.
Memory is allocated in finite-sized blocks calledseg-ments. Each
segment may be owned by a single operatorat a time. A SCORE TM may
allocate new segments andpass them on to an FSM or TM node that it
creates. Upontermination, when a TM or FSM node enters
thedonestate,
4
-
Compute Model Programming Model Execution Model
Abstract model capturingessential semantics ofcomputation
Particular set of programming con-structs providing a convenient
wayto express computations in the com-pute model
Low-level (executable) descriptionof the computation and the
seman-tics which the hardware is expectedto provide when
interpreting this de-scription
The programming model is ab-stracted from certain details
thatarise in the execution model, like ar-chitectural page size or
number ofregisters.
The execution model is abstractedfrom certain hardware size
detailslike number of resources.
sequential execution C+Unix MIPS-ISA+ single global memory +
Unix-ABI
C+WinAPI x86-ISA+ WinABI
SCORE C++ + TDF MIPS-ISA+ ScoreRT + linux + linux-ABI
+ SCORE 256 4-LUT CPs+ SCORE
-
it returns ownership of any received segments back to
theoperator that created it. If an operator attempts to accessa
memory segment that it does not presently own, that ac-cess is
blocked (i.e. the operator stalls) until the operatorregains
ownership of the memory segment.
The operational semantics of the SCORE computemodel are fully
deterministic. This follows from the de-terminism of individual
operators, the timing indepen-dent communication discipline, and
the fact that opera-tors cannot side-effect each other’s state. In
particular,(1) operators communicate with each other only
throughstreams, whose token flow semantics guarantee a
timing-independent order of execution; (2) memory segmentshave a
single, unique owner at any time and thus do notsuffer from
multiple-accessor, read/write-ordering hazards.Thus, the observable
results of a SCORE computation arecompletely independent of the
timing of any operator orthe delay along any stream.
Appendix B defines the compute model more precisely.
3.2 Programming Model
A programming model gives the programmer a frame-work for
describing a computation in a manner indepen-dent of device limits,
along with guidelines for efficientexecution on any hardware
implementation. It can be moreabstract than the execution model
because the compilerwill take care of translating the higher level
descriptionprovided by the programmer into the details needed for
ex-ecution. The key abstractions of the SCORE programmingmodel
areoperators, streams, andmemory segments.
3.2.1 Basic ComponentsOperatorsAn operator represents a
particular algorithmic transfor-mation of input data to produce
output data. Operators arethe computational building blocks for a
computation (e.g.multiplier, FIR filter, FFT). Operators may be
behavioralprimitives or hierarchical graph compositions of other
op-erators. Figure 3 shows an example video processing op-erator
composed as a pipeline of transformations, includ-ing amotion
estimationoperator, an imagetransformationoperator, a
dataquantizationoperator, and acodingopera-tor. The size of an
operator in hardware is implementationdependent and is in no way
limited in the programmingmodel. Operators may need to be
partitioned to fit onto anarchitectural compute page. Partitioning
is an integral partof the automated in the compilation process.
StreamsInter-operator communication uses a streaming data
flowdiscipline. When the programmer needs to connect oper-
ators together, he links the producer to the consumer op-erator
using astreamlink. The link both serves to definewhere data is
logically routed and acts as an unbounded-length queue for data
tokens. Operators signal both whenthey are producing data and when
they need to consumedata. This signalling translates into data
presence signalson the stream links which synchronize all
communicationbetween operators.
Memory SegmentsA memory segment is a contiguous block of memory
andserves as the basic unit for memory management. Memorysegments
may be any size, up to an architecturally definedmaximum. A memory
segment may be used in a SCOREcomputation by giving it a specific
operating mode (e.g.sequential read, random-access read-write,
FIFO) with ap-propriate stream interface, then linking it into a
data flowgraph like any other operator (see Figure 5).
3.2.2 Dynamic Features
On top of these basic components, SCORE supports anumber of
important dynamic features.• Dynamic rate operators• Dynamic graph
composition and instantiation• Dynamic handling of uncommon
events
Dynamic rate operatorsAn operator may consume and produce tokens
at data-dependent rates. This expressive power allows SCOREto
describe efficient operators for tasks such as data com-pression,
decompression, and searching or filtering. Sec-tion 5.2 shows a
possible set of linguistic constructs forsupporting dynamic rate
consumption and production. Toexploit dynamic rates, scheduling
decisions should bemade at run time, when the dynamic rates and
actual dataavailability are known.
Dynamic composition and instantiationSCORE allows run-time
instantiation of operators and dataflow graphs. That is, the
computational graph may becreated, extended, or modified during
execution. Extend-ing the graph means creating new graph nodes and
edgeswhich may be defined in a data-dependent manner. An op-erating
node may terminate during execution, and existingstream links may
be shut down by their attached operators.
This mechanism has several benefits over describing acomputation
strictly by a static graph at compile-time. Itgives the programmer
an opportunity to postpone or avoidallocating resources for parts
of the computation whichare not used immediately or whose resource
requirementscannot be bound until run time. It also enables the
cre-ation of data-dependent computational structures, for in-
6
-
Mot
ion
Est
imat
ion
Tra
nsfo
rmat
ion
Qua
ntiz
atio
n
Cod
ing
Figure 3: Video Processing Operator
stance, to exploit dynamically-unrolled parallelism. Fi-nally,
it creates a framework in which aggressive imple-mentations may
dynamically specialize operators aroundinstantiation parameters.
That is, an operator may havepa-rametersbound atinstantiation
time—i.e. when the opera-tor is composed into a data flow graph.
This mechanism al-lows operators to be initialized with unchanging
or slowlychanging scalar data or to be specialized around
parametervalues. Examples in Section 5.2 show one set of
linguisticconstructs to support composition and instantiation.
Exception handling
Exception handling falls naturally out of the data flow
dis-cipline of SCORE. When an unusual condition occurs, theoperator
may raise an exception. At this point, the operatorstops rather
than producing output data. Dependent, down-stream operators may
have to stall waiting for this operatorto resume and produce an
output, but the data flow disci-plines guarantees that they wait
properly for the operatorto handle the exception and produce a
result. When the ex-ception is handled, the raising operator
resumes operation,producing data, and allowing the downstream
operators toresume in turn.
3.3 Execution Model
The key idea of a computer architecture is that it definesthe
computational description that a machine will run andthe semantics
for running it (e.g. the x86 ISA is a populararchitectural
definition for processors). Someone buildinga conforming device is
then free to implement any detailedcomputer organization that reads
and executes this compu-tational description (e.g. i80286, i80386,
i80486, Pentium,and K6 are all different implementations that run
the samex86 computational description). Following this
technique,the execution model for SCORE defines the run-time
com-putational description for an architecture family and the
se-mantics for executing this description.
The SCORE execution model defines all computation interms of
three key components:• A compute page(CP) is a fixed-size block of
reconfig-
urable logic which is the basic unit of virtualization
andscheduling.
• A memory segmentis a contiguous block of memorywhich is the
basic unit for data page management.
• A Stream linkis a logical connection between the out-put of
one page (CP, segment, processor, or I/O) and theinput of another
page. Stream implementations will bephysically bounded, but the
execution model providesa logically unbounded stream
abstraction.
A computational description in this execution model is
in-dependent of the size of the reconfigurable array,
admittingarchitectural implementations with anywhere from one to
alarge number of compute pages and memories. The modelprovides the
semantics of an unlimited number of indepen-dently operating
physical compute pages and memory seg-ments. Compute pages and
segments operate on streamdata tagged with input presence and
produce output datato streams in a similar manner. The use of data
presencetags provides an operational semantics that is
independentof the timing of any particular SCORE-compatible
com-puting platform.
Fixed Compute-Page SizesCompute pages are the basic unit of
virtualization, schedul-ing, reconfiguration, and relocation. In
analogy with a vir-tual memory page, a compute page is the minimum
unit ofhardware which is mapped onto physical hardware and
ismanaged as an atomic entity. Each compute page repre-sents a
fixed-size piece of reconfigurable hardware (e.g.644-LUTs). Compute
pages differ from the operators of thecompute model in that pages
have architecturally imposedresource limitations such as size and
maximum number ofstreams.
The decomposition of a computation into computepages takes the
stand that it is not feasible nor desirable tomanage every
primitive computational building block (e.g.
7
-
C2
C1
16b32b
(a)
Mpy-by-const108 LEs each
32b add 32 LEs
(b) (d)
1
12
2
(c)
Figure 4: Example of Page Decomposition: (a) original operator
as seen in programming model, (b) mapped to logicelements (LEs) in
target architecture, (c) partitioning into 64-LE pages to match
execution model page size, (d) final graphused by execution
model
4-LUT) as an independent entity—just as it is generally
notdesirable to manage every bit of memory as an independentblock.
Rather, by grouping together a larger block of re-sources,
management and overhead can be amortized overthe larger number of
computational blocks. This group-ing also allows hard problems,
like placement and rout-ing, to be performed offline within each
page. Note thatit is necessary that the page size be fixed across
an ar-chitecture family so that all family member can run fromthe
same run-time (binary) description. Otherwise, page(re-)packing,
placement, and routing would need to be per-formed online. The
fixed page discipline requires that com-pilers partition (or pack)
more abstract computational op-erators into these fixed size pages.
Figure 4 shows an ex-ample decomposition of an operator graph into
pages.
Compute pages may contain internal state which mustbe saved and
restored when the page is swapped onto oroff of a physical compute
page. Swapping may be nec-essary in a time-multiplexed
implementation and is key tosupporting the semantics of an
unbounded number of com-pute pages.
Memory Segments and Configurable Memory BlocksA memory segment
is a contiguous block of memorywhich is managed as a single, atomic
memory block forthe purposes of swapping and relocation. A memory
seg-ment may be used in one of several modes (e.g.sequentialread,
random-access read-write, FIFO). When configured
VCP0
Segment0
Segment1
VCP1
VCP2
Figure 5: Data Flow Computation Graph with both Com-pute Pages
and Segments
8
-
into a particular mode, a segment has logical stream portsto
connect it to the graph of pages (e.g.streams for randomaccess:
address input, data input, data output, control in-put). Figure 5
shows an example graph connecting pagesand segments.
To use a memory segment, the run-time system will mapit into a
configurable memory block(CMB). The CMB isa physical memory block
inside the reconfigurable array(See, for example, Figure 11) with
active stream links andinterconnect to connect the memory segment
into the ac-tive computation. In addition to holding user-specified
seg-ments, CMBs are also used to hold segments containingCP
configurations, segments containing CP state, and seg-ments
associated with stream buffers. A single CMB mayhold any number of
each of these types of segments as longas their aggregate memory
requirement does not exceed theCMB’s capacity (see Figure 6 for a
sample memory lay-out in a CMB). In our current vision, only a
single suchsegment may actually be active in each CMB at any
pointin time, but there is nothing in the SCORE definition
thatprevents an implementation from being designed to
handlemultiple, active segments in the same CMB.
Physically Finite, Logically Unbounded StreamsStreams form the
data flow links between pages. A page(CP or segment) indicates when
it is producing a valid dataoutput with an out-of-banddata
presentbit. The valid datavalue with its associated presence bit is
termed atoken.The token is transported to the destination input of
the con-suming operator. The stream delivers all data items
gen-erated by the producer, in order, to the consumer, storingeach
until the consumer indicates it has consumed it fromthe head of its
input queue (See Figure 7). The data pres-ence tag in a token
serves a similar role to astall signal ina conventional virtual
memory or cache architecture; thatis, it lets the processing unit
know if data is available and itcan continue processing or if the
processing unit must waitfor data to arrive.
When a stream is empty, the downstream operator willstall
waiting for more input data. This discipline hidesthe detailed
timing of operations from the programmingmodel, guaranteeing
correct behavior while allowing vari-ations between implementations
of the computing archi-tecture.
Even at the run-time level, streams provide the abstrac-tion of
unbounded capacity links between producers andconsumers.3 In
practice, however, the streams are finite,with an
implementation-dependent buffer capacity. To im-plement the
semantics of unbounded, FIFO stream links,an implementation will
usebackpressure(See Figure 7)
3See Appendix A for a discussion of why unbounded buffers are
nec-essary.
to stall production of data items, and the run-time systemwill
allocate additional buffer space in FIFO segments asneeded (See
Figure 8 for an example of stream buffer ex-pansion).
Physically, a virtual stream may be realized in one oftwo ways:•
When both the producer and the consumer of a vir-
tual stream are loaded on the physical hardware, thestream link
can be implemented as a spatial connec-tion through the inter-page
routing network betweenthe two pages.4 (See Figure 9.)
• When one of the ends of the stream is not resident, thestream
data can be sinked (or sourced) from a streambuffer segment active
in some CMB on the component.(See Figure 10.)
This allows efficient, pipelined chaining of connected
op-erators when space permits, as well as deep, intermediatedata
buffering when a computation must be sequentialized.
Hardware VirtualizationCompute pages, segments, and streams are
the fundamen-tal units for allocation, virtualization, and
management ofthe hardware resources. At run time, an operating
sys-tem manager schedules virtual pages and streams onto
theavailable physical resources, including page assignmentand
migration and inter-page routing.
If there are enough physical resources, every page of
acomputation graph may be simultaneously loaded on
thereconfigurable hardware, enabling
maximum-speed,fully-spatialcomputation. Figure 9 shows this case
for the videoprocessing operator of Figure 3. If hardware
resourcesare limited, a computation graph will be
time-multiplexedonto the hardware. Streams between virtual pages
that arenot simultaneously loaded will be transparently
bufferedthrough on-chip memory. Figure 10 shows this case forthe
video processing operator. Each component operatoris loaded into
hardware in sequence, taking its input fromone memory buffer and
producing its output to another.
3.4 Model Implications
3.4.1 Advice for Programmers
One goal of the compute model is, at a high-level, to focusthe
developer on the style of computation which is effi-cient for the
hardware and execution model. To better uti-lize scalable
reconfigurable hardware, SCORE developersshould:
• Describe computations as spatial pipelines with mul-tiple,
independent computational paths.A hardware
4An implementation could choose to implement this link as a
staticallyconfigured path as in FPGAs, a time-switched path, or
even a dynamicallyrouted path.
9
-
Base: 0x01000Bound: 0x02000Mode: Read-only
Memory
Data out
Data inAddrCtrl
CMB
0x00000:config. segment
0x01000: user segment 1
0x02000:user segment 2
0x00F00: state
0x03800: FIFO
0x0400: unused
Active segment
Figure 6: Segments and Other Data mapped onto a CMB
0
1
0
0
10
1
Sour
ce P
age
Sink
Pag
e
0x00
10
0xA
BC
D
0xFF
FF
No data being transmitted
Full, asserting backpressure
backpressure
data present
data
w
Network Stages
Inpu
t Que
ue
Figure 7: Stream Signals
10
-
fullbackpressure
CP A�
CP B�
Data
1. Op A sends data to Op B
CP A�
CP B�
Data
2. Op B becomes full�
CP A�
CP B�
3. Backpressure from B�
stops Op A sending data
CP A�
CP B�
4. No Data being sent�
CP A�
CP B�
BufferSegment
5. Runtime allocates new�
segment to buffer stream
CP A�
CP B�
6. Runtime dissassembles�
direct connection A−>B
CP A�
CP B�
7. Runtime creates new links�
A−>segment; segment−>B
CP A�
CP B�
8. Op A sends data to �
stream buffer segment
CP A�
CP B�
9. When Op B can take data,�
it takes it from segment
CP A�
CP B�
10. Segment serves as large buffer between A and B
CP B�
CP A�
11. Runtime may decide to replace Op A
CP B�
12. Op B may continue taking data from stream buffer
Figure 8: Expansion of Finite Stream Buffer to provide Unbounded
Stream Buffer Semantics
11
-
Quantize Code
MotionEstimation
Transform
CMB
CP
buffer
Figure 9: Fully Spatial Implementation of Video Processing
Operator on Abstract SCORE Hardware
Transform
buffer
buffer
Code
buffer
buffer
Quantize
bufferbuffer
buffer
MotionEstimation swap
swap swap
Figure 10: Capacity-Limited, Temporal Implementation of Video
Processing Operator
implementation will attempt to concurrently executeas many of
the specified, parallel paths as possible.
• Avoid or minimize feedback cycles.Cyclic dependen-cies
introduce delays which cannot be pipelined awayand hence increase
the total run time or lead to page-thrashing in small hardware
implementations.
• Expose large data streams to SCORE operators.Large data sets
help amortize the overhead of load-ing computation into
reconfigurable hardware, espe-cially into small, time-multiplexed
hardware imple-mentations.
3.4.2 Generality
While we have described the SCORE hardware model herein terms of
a single processor and homogeneous computa-tional pages and
memories, the model itself admits a num-ber of extensions. SCORE
can accomodate heterogeneousand specialized computational pages, as
seen in Pleiades[26] and Cheops [4]. Using specialized pages most
ef-ficiently makes the scheduling problem more interesting,since
some operators may run on multiple kinds of special-ized pages.
Also, there is nothing which prohibits SCOREfrom using multiple
conventional processors for executingsequential operators and/or
the run-time scheduler. Con-ventional techniques for
multiprocessing and distributedscheduling would be relevant in this
case.
12
-
4 Hardware Requirements
SCORE assumes a combination of a sequential proces-sor and a
reconfigurable device. The reconfigurable arraymust be divided into
a number of equivalent and indepen-dent compute pages.5 Multiple,
distributed memory blocksare required to store intermediate data,
page state, and pageconfigurations.
The interconnect among pages is critical to achievinghigh
performance and supporting run-time page place-ment. It should
support high bandwidth, low latency com-munication among compute
pages and memory, allow-ing memory pages to be used concurrently.
The inter-connect must buffer and pipeline data as well as
provideback-pressure signals to stall upstream computation
whennetwork buffer capacity is exceeded. Routing resourcesshould be
sufficiently rich to facilitate rapid, online rout-ing.
The compute pages themselves may use any reconfig-urable fabric
that supports rapid reconfiguration, with pro-vision to save and
restore array state quickly. The BRASSHSRA subarray design [30] is
a feasible, concrete imple-mentation for a compute page. It
provides microsecondreconfiguration and high-speed, pipelined
computation.
Each configurable memory block(CMB) is a self-contained unit
with its own stream-based memory port andan address generator (see
Figure 6). CMBs may be ac-cessed independently and concurrently in
a scalable sys-tem. The memory fabric may use external RAM or
on-chip memory banks (e.g.BRASS Embedded DRAM [25]),with additional
logic to tie into the data flow synchroniza-tion used by the
interconnect network. The memory con-trollers need to support a
simple, paged segment modelincluding address relocation within a
memory block andsegment bounds. Streaming data support obviates the
needfor external addressing during reconfiguration and
streambuffering.
The sequential processor plays an important part in theSCORE
system. It runs the page scheduler needed to vir-tualize
computation on the array, and it executes SCOREoperators that would
not run efficiently in reconfigurableimplementation. Consequently,
the processor must be ableto control and communicate with the array
efficiently. Asingle-chip SCORE system (e.g. see Figure 11)
integrat-ing a processor, reconfigurable fabric, and memory
blockscould provide tight, efficient coupling of components.
Although a single-chip SCORE implementation offersbenefits for
performance and design efficiency, the SCOREmodel permits a wide
range of implementations includingone using conventional,
commercial components.
5In a degenerate case, there can be only one page, but this
sacrificesmany of the strengths of the SCORE model.
uP
L1 i$
L1 d$
L2 $
CP
CMB
CP
CMB
Netw
orkCMB
CP
CP
CMB
Figure 11: Hypothetical, single-chip SCORE system
5 Language Instantiations
As a computational model, any number of languageswhich obey the
SCORE semantics could be defined to de-scribe SCORE computations.
One could define subsetsof conventional HDLs (e.g. Verilog, VHDL)
with styl-ized Input/Output primitives to describe SCORE opera-tors
and operator composition. Similarly, one could de-fine subsets of
conventional programming languages (e.g.C++, Java) to perform these
tasks. To focus on the neces-sary semantics, we have defined an
intermediate register-transfer level language (RTL) to describe
SCORE oper-ators and their composition for our initial
developmentwork. We view our intermediate language, TDF, as
adevice-independent, assembly language target on the wayto
architecture-specific executable operators.
5.1 SCORE Language Requirements
As indicated by the semantics of the SCORE computemodel, SCORE
operators are synchronous, single clockentities, with their own
state. Operators communicateonlythrough designated I/O streams.
Operation is gated by datapresence on the I/O streams. As such,
each operator can beviewed as a finite-state machine with
associated data path(i.e. FSMD [12]). In a multithreaded language,
such asJava or C++ with an appropriate thread package, a
SCOREoperator would be an independent thread which commu-nicates
with the rest of the program only through single-reader,
single-writer I/O streams. Specifically, SCOREdoes not have a
global, shared-memory abstraction amongoperators. An operator
mayown a chunk of the addressspace (a memory segment) during
operation and return itafter it has completed, but no two operators
may simulta-
13
-
fir4(param signed[8] w0, param signed[8] w1param signed[8] w2,
param signed[8] w3,// param’s bound at instantiation timeinput
unsigned[8] x,output unsigned[20] y)
{state only(x): // ‘‘fire’’ when x present{
y=w0*x+w1*x@1+w2*x@2+w3*x@3;// x@n notation picks out// nth
previous value for// x on input stream.// (this notation is//
patterned after Silage)goto only; // loop in this state
}}
Figure 12: TDF Specification of 4-TAP FIR (a static
rateoperator)
neously own a piece of memory.
5.2 TDF
TDF is basically an RTL description with special syn-tax for
handling input and output data streams from theoperator. Common
data path operators can be describedusing a C-like syntax. For
example, Figure 12 shows howan FIR computation might be implemented
in TDF. Op-erators may have parameters whose values are bound
atoperator instantiation time; parameters are identified withthe
keywordparam . In the FIR example, the coefficientweights are
parameters; these are specified when the oper-ator is created and
the values persist as long as the operatoris used. The FIR defines
a single input stream (x ) and pro-duces a single output stream (y
). The behavior of the stateis gated on the arrival of the nextx
input value, producinga newy output for each such input.
To allow dynamic rate operators, the basic form of abehavioral
TDF operator is that of a finite-state machine.Each state specifies
the inputs which must be present be-fore it can fire. Once the
inputs arrive, the operator con-sumes the inputs and the FSM may
choose to change statesbased on the data consumed from the inputs.
A simplemerge operator is shown in Figure 13, demonstrating howthe
state machine can also be used to allow data dependentconsumption
of input values. Output value production canbe conditioned as shown
in Figure 14. Together, these al-low the user to specify arbitrary,
deterministic, dynamic-rate operators.
Of course, the FSM gives the user the semantic power todescribe
heavily sequential and complex, control-oriented
N.B. This version has been simplified for il-lustration; It does
not properly handle theend-of-stream condition.
signed[w] merge(param unsigned[6] w,// can use parameters to
define// data width
input signed[w] a,input signed[w] b)
{signed[w] tmpA;signed[w] tmpB;// states used here to show
dynamic// data consumptionstate start(a,b):
{tmpA=a; tmpB=b;if (tmpA
-
// uniq behaves like the unix command// of the same name; it
filters an// input stream, removing any adjacent,// duplicate
entries before passing them// on to the output stream.signed[w]
uniq(param unsigned[6] w,
input signed[w] x){
signed[w] lastx;state start(x):
{ lastx=x; uniq=x; goto loop;}state loop(x):
{if (x=!lastx)
{ lastx=x; uniq=x; }goto loop;
}}
Figure 14: TDF Specification ofuniq Operator (a dy-namic output
rate operator)
merge3uniq(param unsigned[6] n,input signed[n] a,input signed[n]
b,input signed[n] c,output signed[n] o)
{signed [n] t;t=merge(n,merge(n,a,b),c);o=uniq(n,t);
}
Figure 15: TDF Compositional Operator
operators. Nonetheless, the programmer should avoid
se-quentialization and complex control when possible, as op-erator
with many states are less likely to use spatial com-puting
resources efficiently.
Larger operators can be composed from smaller opera-tors in a
straight-forward manner as shown in Figure 15.
5.3 C++ Integration and Composition
With a suitable stream implementation and interfacecode, SCORE
operators can be instantiated by and usedwith a conventional,
multithreaded programming lan-guage. Figure 16 shows an example C++
program whichuses themerge anduniq operators defined here. Notethat
SCORE operator instantiation and composition can beperformed from
the C++ code. Once created, the SCOREoperators behave as
independently running threads, oper-ating in parallel with the main
C++ execution thread. Ingeneral, a SCORE operator will run until
its input streamsare closed or its output streams are freed.
Once primitive behavioral (or leaf) operators are defined(e.g.
in TDF or some other suitable form) and compiledinto their
page-level implementation, large programs canbe composed entirely
in a programming language as shownhere. If one thinks of TDF as a
portable assembly lan-guage for critical computational building
blocks, then thislanguage binding allows a high-level language to
composethese building blocks in much the same way that assem-bly
language kernels have been composed using high-levellanguages in
order to efficiently program early DSPs andsupercomputers. The
instantiation parameters for TDF op-erators allow the definition of
generic operators which canbe highly customized to the needs of the
application.
6 Execution Example
The following example demonstrates execution of thedesign in
Figure 16. It shows array compute page reconfig-uration, execution
of scheduled behavioral code, and somefundamental control
signals.
To ground this explanation to a particular hardware
con-figuration and its constraints, we make the following
as-sumptions about the reconfigurable array parameters andthe TDF
design in the user application:
• The design consists of three behavioral operators.Full
implementation of each operator requires onlyone compute page.
• The reconfigurable array contains one compute page(CP) and
three configurable memory blocks (CMBs).
15
-
#include "Score.h"#include "merge.h"#include "uniq.h"int
main(){
char data0[] = { 3, 5, 7, 7, 9 };char data1[] = { 2, 2, 6, 8, 10
};char data2[] = { 4, 7, 7, 10, 11 };// declare
streamsSIGNED_SCORE_STREAM i0,i1,i2,t1,t2,o;// create 8-bit wide
input
streamsi0=NEW_SIGNED_SCORE_STREAM(8);i1=NEW_SIGNED_SCORE_STREAM(8);i2=NEW_SIGNED_SCORE_STREAM(8);//
instantiate operators// note: instantiation passes parameters// and
streams to the SCORE
operatorst1=merge(8,i0,i1);t2=merge(8,t1,i2);o=uniq(8,t2);//
alternately, we could use:// new merge3uniq (8,i0,i1,i2,o);// write
data into streams// (for demonstration purposes;// real streams
would be much longer// and probably not come from main)for (int i =
0; i < 5; i++) {
STREAM_WRITE(i0, data0[i]);STREAM_WRITE(i1,
data1[i]);STREAM_WRITE(i2, data2[i]);
}
STREAM_CLOSE(i0); // close inputSTREAM_CLOSE(i1); //
streamsSTREAM_CLOSE(i2);// output results// (for demonstration
purposes only)for (int cnt=0; !STREAM_EOS(o); cnt++) {
cout
-
Table 1: Step-by-Step Execution ExampleTime Physical Array View
Description
I
Active Seg:Mode:S0
S1
S2S3
S2SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
?Reconfig
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i3: 4, 7, 7, ...
i0: 3, 5, 7, ...
A conf/st
Active Seg:Mode:S0
S1
S2S3
S0SeqSrc
0% 100%
0% 100%
CMB1
i1: 2, 2, 6, ...
B conf/st
Active Seg:Mode:S0
S1
S2S3
S0SeqSrc
0% 100%
0% 100%
CMB2
C conf/st
Initially assume that the contents of streamsi0 ,i1 , andi2 have
been loaded by the main proces-sor into segments CMB0 S0, CMB0 S1,
CMB1S0. In addition, the configuration and initial stateof pages
(operators) A, B, and C has been loadedinto segments CMB0 S2, CMB1
S2, and CMB2S2 respectively.Reconfiguration. Page A (merge ) is
scheduledto run for the first timeslice. First, CP0 is con-figured
with the contents of CMB0 S2. Then, thestreams are setup between
CMBs and the CP0 asshown on the next diagram.
II
Active Seg:Mode:S0
S1
S2S3
S0SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
ARun
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i3: 4, 7, 7, ...
i0: 7, 9
A conf/st
Active Seg:Mode:S0
S1
S2S3
S0SeqSrc
0% 100%
0% 100%
CMB1
i1: 8, 10
B conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSink
0% 100%
0% 100%
CMB2
t1: 2, 2, 3, ...C conf/st
Array Status. CP0 is running behavioral code ofoperator A (merge
). CMB controller has set upappropriate active segment and
operation mode foreach CMB. On this diagram, CMB0 and CMB1act as
SeqSrc (sequential source) and CMB2 —SeqSink (sequential sink)
relative to the connectedstreams. At this time, approximately half
of tokenshave been consumed from both sources CMB0 S0and CMB1 S0
and sunk into CMB2 S1.
III
Active Seg:Mode:S0
S1
S2S3
S0SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
AStall
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i3: 4, 7, 7, ...
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S0SeqSrc
0% 100%
0% 100%
CMB1
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSink
0% 100%
0% 100%
CMB2
t1: 2, 2, 3, ...C conf/st
EMPTYEMPTY
End of the first timeslice.Array Status. All tokens from both
sourcesCMB0 S0 and CMB1 S1 have been consumed byCP0 and sunk into
CMB2 S1.If a source node of a stream is not producingany tokens
(e.g. empty segment CMB0 S0), asink node could stall due to
unavailability of in-put tokens (e.g. CP0 is stalled, since
operatorA requires tokens on at least one input to fire).On the
diagram such streams are identified withEMPTY. Scheduler uses the
information aboutEMPTYstreams to optimize schedule for the
nexttimeslice.
Active Seg:Mode:S0
S1
S2S3
S2SeqSink
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
AReconfig
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i3: 4, 7, 7, ...
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S2SeqSrc
0% 100%
0% 100%
CMB1
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSink
0% 100%
0% 100%
CMB2
t1: 2, 2, 3, ...C conf/st
1
2
Reconfiguration. CP0 reconfiguration consists oftwo logically
sequential steps, that could be paral-lelized if the array
implementation permits.
1. Save current configuration and state of CP0in CMB0 S2, which
is allocated for A.
2. Load the configuration and state for page Binto CP0 from CMB1
S2.
After CP0 has been configured, the streams arecreated between
compute nodes as shown on thenext diagram. CP0 is ready to run.
continued on next page
17
-
continued from previous pageTime Physical Array View Command
Description
IV
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
BRun
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i2: 7, 7, 10, ..
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSink
0% 100%
0% 100%
CMB1
t2: 2, 2, 3, ...
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%
CMB2
t1: 7, 7, 10, ..C conf/st
Array Status. CP0 is running behavioral codeof operator B (merge
). Approximately half oftokens in CMB0 S1 and CMB2 S1 have been
con-sumed by CP0 and sunk into CMB1 S1.
V
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
BStall
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i2:
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSink
0% 100%
0% 100%
CMB1
t2: 2, 2, 3, ...
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%
CMB2
t1:C conf/st
EMPTYEMPTY
FULLEnd of the second timeslice.Array Status. All tokens from
both sourcesCMB0 S1 and CMB2 S1 have been consumed byCP0 and sunk
into CMB1 S1.If a sink node of a stream is not consuming
tokens(e.g. 100% full CMB1 S1), a source node couldstall on a
stream write. On the diagram suchstreams are identified withFULL.
Scheduler usesthe information aboutFULL streams to optimizeschedule
for the next timeslice.
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
BReconfig
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i2:
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S2SeqSink
0% 100%
0% 100%
CMB1
t2: 2, 2, 3, ...
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S2SeqSrc
0% 100%
0% 100%
CMB2
t1:C conf/st
1
2
Reconfiguration. Two main steps of reconfigura-tion are similar
to those at time III. CP0 is loadedwith configuration and state of
page C (uniq ).
VI
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
CRun
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i2:
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%
CMB1
t2: 7, 8, 9, ...
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S0SeqSink
0% 100%
0% 100%
CMB2
t1:
O: 2, 3, 4, ...
C conf/st
Array Status. CP0 is running behavioral code ofoperator C (uniq
). Approximately half of tokensin CMB1 S1 have been consumed by CP0
andsunk into CMB2 S0.
continued on next page
18
-
Timeslice
Run A Run B Run CI II III IV V VI VII
time
Reconfiguration Page Execution
Figure 17: Timeline for Execution Example
continued from previous pageTime Physical Array View Command
Description
VII
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
CStall
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i2:
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%
CMB1
t2:
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S0SeqSink
0% 100%
0% 100%
CMB2
t1:
O: 2, 3, 4, ...
C conf/st
EMPTY
End of the third timeslice.Array Status. All tokens from CMB1 S1
havebeen consumed by CP0 and sunk into CMB2 S0.
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%Logic/FSMs
Configuration:Mode:
CReconfig
In 0 FIFO
In 1 FIFO
Out 0 FIFO
Out 1 FIFO
CP0 CMB0
i2:
i0:
A conf/st
Active Seg:Mode:S0
S1
S2S3
S1SeqSrc
0% 100%
0% 100%
CMB1
t2:
i1:
B conf/st
Active Seg:Mode:S0
S1
S2S3
S2SeqSink
0% 100%
0% 100%
CMB2
t1:
O: 2, 3, 4, ...
C conf/st
Reconfiguration. The current configuration andstate of CP0 are
saved in CMB2 S2.Note: For the application that was
demonstratedhere, saving configuration and state of CP0 wasnot
necessary. A, B, and C were only scheduledonce, and therefore after
each one runs on the CP0for a timeslice, its state is no longer
needed. Sav-ing of configuration and state was shown for
com-pleteness only. Should any of the pages be sched-uled to run in
several non-consecutive timeslices,their state must be saved every
time they are pre-empted and restored when scheduled. This is
anal-ogous to context switching in traditional
operatingsystems.
19
-
to produce master files (merge.cc and uniq.cc) and
alsoparameterized instance files (merge8.cc and uniq8.cc).The next
step is to compile all C++ sources including thedriver code in
main.C and thetdfc -generated sources.The build process terminates
when all driver code is linkedwith the master files to produce user
application executable(a.out ), and instance objects are linked
with the run-timesystem libraries to produce dynamically linked
shared ob-ject libraries (merge8.so and uniq8.so) containing the
in-stance code. The purpose of this process should becomeclear in
the next section as we describe the SCORE run-time environment in
more detail.
7.2 Run-time Environment
The run-time system consists of the scheduler and thesimulator
processes that execute under Linux as shown inFigure 19. In a real
system, the OS kernel will containthe scheduler, and a
reconfigurable hardware array will re-place the simulator. These
components are connected bya pair of streams that permit
bidirectional communicationand transmit scheduler commands and
resource state to andfrom the array. The scheduler consists of
instantiation andscheduling engines.
Instantiation engine. Being an independent process, thescheduler
has no knowledge of user applications’ computegraphs. The run-time
system together with shared objectfiles built with a user
application provide a way to commu-nicate the structure of compute
graphs from a user applica-tion to the scheduler:
1. Upon invocation, a user application places a seriesof
requests to the scheduler to instantiate its computegraph nodes.
This is accomplished by the code in themaster files, produced
bytdfc and linked with theuser executable. The code contains a
sequence of op-erations to connect to the scheduler through an
IPCchannel and request to instantiate an operator. For ex-ample, in
Figure 19 the code in the invokedmerge()routine requests
instantiation of themerge operatorwith inputst1 andi2 .
2. With the request, the scheduler receives a pointer tothe
shared object file which contains the behavioralcode and the
attributes of a parameterized instance ofan operator. The run-time
system dynamically linkswith that shared object file (here,
merge8.so), and thescheduler instantiates an operator and places it
on awaiting list to be scheduled. Note that the shared ob-ject is
necessary here in order to get the user’s applica-tion code loaded
into the address space of the sched-uler which, of course, was
built without any knowl-edge of the user code which it might be
asked to run.
The array simulator executes the behavioral code foreach
resident compute node.
Scheduling engine. The scheduling engine is invokedevery
timeslice and is responsible for resource allocationand
utilization, placement, and routing on the array. Itacts as a
resource manager capable of enforcing a varietyof policies from
fair sharing of the compute resources be-tween multiple user
applications to favoring a particularapplication to meet its
real-time constraints.
Array simulator. The simulator provides a cycle ac-curate
simulation by executing compute node behavioralcode, found in
corresponding dynamically linked sharedobject files (e.g.
merge8.so). As noted earlier, it com-municates with the scheduler
through a pair of streams.Implemented using shared memory, streams
also providedirect communication between a user application and
thearray simulator. In the example in Figure 16 these streamsarei0
, i1 , i2 , ando.
8 Example: JPEG
As described in the previous section, we have imple-mented a
complete SCORE run-time system and simula-tor on top of Linux and
are beginning to develop severalapplications to guide our further
understanding of criticaldesign issues for these systems. As an
early exercise anddemonstration vehicle, we have implemented a
completeJPEG (Joint Photographic Experts Group) image compres-sion
algorithm [33] in TDF and C++ and performed basicscaling
experiments where we vary the number of compu-tational pages in the
system.
8.1 Application
The JPEG compressor mathematically decomposes theinput data into
high and low frequency components. Theimage is first segmented into
8×8 pixel blocks, and thenthe decomposition is performed on every
individual blockvia the DCT (Discrete Cosine Transform), a unitary
trans-form that takes the pixel block as an input and returns
an-other 8×8 block of coefficients, most of which are closeto zero.
The coefficients are then scalar quantized andscanned into a
one-dimensional stream via azigzagscan.Quantized coefficients are
subsequently compacted withzero-length encoding, after which runs
and lengths areHuffman encoded. (See Figure 20.)
Our TDF implementation uses 13 512-LUT pages inorder to realize
a fully spatial JPEG compressor whichis capable of processing one
image sample per cycle.
20
-
SCORE runtimeLinux User Processes
User Application Scheduler Simulator
instantiate
schedule
CP0
CMB0
CMB1
CMB2
...merge()merge()uniq()...
merge_8.souniq_8.so
start/stop CPstart/stop CMBtransferCMB2CP...
array_status
waitList residentList
merge
PC---->"instantiate merge"
link at
runtim
e
shared mem
streams
IPC (Array Hardware)a.out
Figure 19: SCORE Run-Time Structure and Interaction to User
Application
2D-DCT4
writeseg
2
muxer1
R/WSEGMENT
R/WSEGMENT
QuantZLE3
MixToHuff
1
READSGMNT
READSGMNT
READSGMNT
READSGMNT
AddressData In
R/W
AddressData In
R/W
BitstreamAssembly
2
88
(N.B.Large numbers on computational blocks indicate number of
512-LUT pages required.)
Figure 20: JPEG Data Flow including Page and Segment
Decomposition
21
-
merge.ouniq.o
a.out
merge.tdfuniq.tdf
tdfc
merge.ccuniq.cc
merge_8.ccuniq_8.cc
merge.fuseruniq.fusermain.C
c++
merge_8.ouniq_8.o
main.o
ld ld -shared
merge_8.souniq_8.so
master instance
Note: Each operator is described in a separate TDFsource file.
Tools used aretdfc (TDF compiler),c++ (standard c++ compiler),
andld (standard linkerwith capability of building stand-alone
executablesand shared dynamically linked libraries).
Figure 18: Build Process for User Application fromFig. 16.
Simulator Parameters Value AssumedReconfiguration Time 5,000
cyclesSchedule Time Slice 250,000 cyclesCompute Page (CP) size 512
LUTsConfigurable Memory Block CMB size 2MbitsExternal Memory
Bandwidth 2GB/s
Table 2: System Parameters for Experiment.
For smaller hardware, the SCORE scheduler automati-cally
manages, at run time, the reconfiguration necessaryto share the
physical CPs among the 13 virtual CPs.
8.2 System Assumptions
For these experiments, we assume a single-chip systemas
described in Section 4, with external memory as neededfor the
application. Table 2 summarizes the parameterswe assume for the
system, based on our experience withthe HSRA [30] and embedded DRAM
memory [25]. Forthese experiments page decomposition is performed
man-ually. The scheduler is list based and operates in a
time-sliced fashion like a conventional operating-system
sched-uler; the scheduler takes care of all decisions on whereto
place CPs and CMBs and manages all reconfigurationand data
transfer, including the data movement on and offthe component as
necessitated by the finite, on-chip mem-ory capacity. We assume
scheduling time is overlappedwith computation and takes 50,000
cycles. We do not, cur-rently, model any limitations on routability
among pages.The simulator accounts for all time required to
reconfigurepages, store state, and transfer data between memories
inthe chip.
8.3 Results
To study the scalability, performance, and efficiency ofSCORE,
we ran our JPEG implementation on a series ofsimulated,
architecture-compatible SCORE systems withvarying numbers of
physical compute pages. Figure 21plots the total run time
(makespan) of each system versusthe number of physical pages in
that system. In this par-ticular experiment, we do not scale
memory, so the resultsshown reflect (1) a fixed memory of 26 CMBs,
and (2) un-limited memory. For comparison, we show a native x86-MMX
implementation using Intel’s referenceijpeg library.
The curves demonstrate that SCORE can automaticallyrun the JPEG
application on less hardware with grace-ful performance
degradation. Thus SCORE can automat-ically realize an area-time
performance tradeoff. Further,the curves show that this application
can be automati-
22
-
0
5
10
15
20
25
0 2 4 6 8 10 12 14
Tot
al T
ime
(m
akes
pan
in m
illio
ns o
f cyc
les)
Physical Compute Pages
JPEG Encode Performance
Pentium II MMXSCORE (26 CMBs)
SCORE (unlimited CMBs)
Figure 21: JPEG CP vs. Makespan
cally virtualized onto half the compute pages of the
fully-spatial implementation without incurring a substantial
per-formance penalty. This kind of result is common when theload on
compute operators vary widely; the lightly-loadedoperators can
time-share a compute page without increas-ing overall runtime.
The experiment exhibits some anomalies of our presentscheduler.
The CP-makespan curves are not strictly mono-tonic due to
heuristics in the list-based page selection ap-proach. Also, the
scheduler is not optimized to minimizememory usage while buffering
streams. In fact, it is notpossible to scale down the number of
CMBs together withphysical CPs in very small hardware because there
wouldnot be enough CMBs to virtualize the streams of
presently-loaded pages. Hence, this experiment assumes a
fixedmemory availability of 26 CMBs (twice the number of CPsin the
application). To factor out the effect of unoptimizedstream
buffering, we also performed the experiment withunlimited memory.
The results exhibit a speedup of upto twofold over the limited
memory case, suggesting thatthere is room for improvement in
scheduling and memorymanagement.
This experiment represents a single set of SCORE sys-tem
parameters. As ongoing work, we are exploring manysystem parameters
to gain insight into the regions of oper-ation where SCORE
scheduling is most robust and to de-termine the parameters that
provide the most efficient and
balanced system design. Such parameters include computepage
size, page I/O bandwidth, memory block size, andreconfiguration
times.
9 Summary
Reconfigurable computation, defined simply as compu-tation
performed on a collection of FPGA or FPGA-likehardware, has shown
remarkable promise on point appli-cations, but has not achieved
wide-spread acceptance andusage. One must make a large commitment
to a particularFPGA-based system to develop an application.
However,as we can now readily predict, the industry produces
newer,larger, and faster hardware at a steady pace.
Unfortunately,without a unifying computational model which
transcendsthe particular FPGA implementation on which the
applica-tion is first developed, one is stuck redoing significant
workto port the application to newer hardware. This is
particu-larly onerous when the established, alternative
technology,the microprocessor, offers users steady performance
im-provements with little or no time investment to adapt tonew
hardware.
Overcoming this liability requires a computationalmodel which
abstracts computational resources, allowingapplication performance
to scale automatically, adaptingto new hardware as it becomes
available. The computa-
23
-
tional model must expose and exploit the strengths of
re-configurable hardware and help users understand how tooptimize
applications for reconfigurable execution. Fur-ther, the
computational model must allow problems to dealefficiently with
dynamic and unbounded resource require-ments and dynamic program
characteristics. Finally, themodel must support the efficient
composition of solutionsfrom abstract building blocks.
In this paper, we have introduced a particular com-putational
model which attempts to address these needs.SCORE uses a paging
model to virtualize all hardware re-sources including computation,
storage, and communica-tion. It allows dynamic instantiation of
dynamically sizedcomputational operators and supports dynamic rate
appli-cations. A page partitioner and compiler along with arun-time
scheduler takes care of automatically mappingthe unbounded and
dynamically unfolding computationalgraph onto the fixed resources
of a particular hardwareplatform. We have outlined the hardware
requirementsfor such a model as well as the kind of programming
lan-guages needed to describe and integrate SCORE computa-tions. We
have implemented a complete SCORE run-timesystem and simulator.
Initial experiments suggest that wecan achieve the desired
scalability on sample applications.With this initial success, we
are now attempting to broadenthe range of applications, automate
more of the SCOREtool flow, and systematically explore the design
space forSCORE compatible architectures.
Acknowledgements
This research is part of the Berkeley
ReconfigurableArchitectures, Software, and Systems (BRASS)
effortsupported by the Defence Advanced Research ProjectsAgency
under contract number DABT63-C-0048 and bythe California MICRO
Program.
References
[1] Arthur Abnous and Jan Rabaey. Ultra-Low-PowerDomain-Specific
Multimedia Processors. InPro-ceedings of the IEEE VLSI Signal
Processing Work-shop (VSP’96), October 1996.
[2] Altera Corporation, 2610 Orchard Parkway, SanJose, CA
95134-2020. APEX Device Family,March 1999. .
[3] Shuvra S. Bhattacharyya, Praveen K. Murthy, andEdward A.
Lee. Software Synthesis from Dataflow
Graphs, chapter Synchronous Dataflow. Kluwer Aca-demic
Publishers, 1996.
[4] Vincent Michael Bove, Jr. and John A. Watlington.Cheops: A
Reconfigurable Data-Flow System forVideo Processing. IEEE
Transactions on Circuitsand Systems for Video Technology,
5(2):140–149,April 1995. .
[5] Gordon Brebner. A Virtual Hardware Operating Sys-tem for the
Xilinx XC6200. InProceedings of the6th International Workshop on
Field-ProgrammableLogic and Applications (FPL’96), pages
327–336,1996.
[6] Gordon Brebner. The Swapable Logic Unit:aParadigm for
Virtual Hardware. InProceedings of the5th IEEE Symposium on FPGAs
for Custom Comput-ing Machines (FCCM’97), pages 77–86, April
1997.
[7] Joseph T. Buck. Scheduling Dynamic DataflowGraphs with
Bounded Memory using the Token FlowModel. PhD thesis, University of
California, Berke-ley, 1993. ERL Technical Report 93/69.
[8] Stephen P. Crago, Brian Schott, and Robert Parker.SLAAC: a
Distributed Architecture for AdaptiveComputing. InProceedings of
the 1998 IEEE Sym-posium on Field-Programmable Custom
ComputingMachines (FCCM’98), pages 286–287, April 1998.
[9] David E. Culler, Seth C. Goldstein, Klaus E.Schauser, and
Thorsten von Eicken. TAM — A Com-piler Controlled Threaded Abstract
Machine.Journalof Parallel and Distributed Computing, June
1993.
[10] Jack B. Dennis. Data Flow Supercomputers.Com-puter,
13:48–56, November 1980.
[11] Jack B. Dennis and David P. Misunas. A
PreliminaryArchitecture for a basic data-flow processor.
InPro-ceedings of the 2nd Annual Symposium on ComputerArchitecture,
January 1975.
[12] Daniel Gajski and Loganath Ramachandran. Intro-duction to
High-Level Synthesis.IEEE Design andTest of Computers, 11(4):44–54,
1994.
[13] Maya Gokhale, Janice Stone, Jeff Arnold, andMirek
Kalinoskwi. Stream-Oriented FPGA Com-puting in the Streams-C High
Level Lanugage.In Proceedings of the 2000 IEEE Symposium
onField-Programmable Custom Computing Machines(FCCM’00). IEEE,
April 2000.
24
-
[14] Seth C. Goldstein, Herman Schmit, Matthew Moe,Mihai Budiu,
Srihari Cadambi, R. Reed Taylor, andRonald Laufer. PipeRench: a
Coprocessor forStreaming Multimedia Acceleration. InProceedingsof
the 26th International Symposium on ComputerArchitecture (ISCA’99),
pages 28–39, May 1999.
[15] Scott Hauck, Thomas Fry, Matthew Hosler, and Jef-fery Kao.
The Chimaera Reconfigurable FunctionalUnit. In Proceedings of the
IEEE Symposium onFPGAs for Custom Computing Machines, pages 87–96,
April 1997.
[16] John R. Hauser and John Wawrzynek. Garp: AMIPS Processor
with a Reconfigurable Coprocessor.In Proceedings of the IEEE
Symposium on Field-Programmable Gate Arrays for Custom
ComputingMachines, pages 12–21. IEEE, April 1997.
[17] C. A. R. Hoare. Communicating Sequential Pro-cesses.
International Series in Computer Science.Prentice-Hall, 1985.
[18] R. A. Iannucci. Toward a Dataflow/Von NeumannHybrid
Architecture. InProceedings of the 15thInternational Symposium on
Computer Architecture,pages 131–40, May 1988.
[19] Jeffery A. Jacob and Paul Chow. Memory Interfac-ing and
Instruction Specification for ReconfigurableProcessors.
InProceedings of the 1999 Interna-tional Symposium on Field
Programmable Gate Ar-rays (FPGA’99), pages 145–154, February
1999.
[20] Mark Jones, Luke Scharf, Jonathan Scott, ChrisTwaddle,
Matthew Yaconis, Kuan Yao, and PeterAthanas. Implementing an API
for Distributed Adap-tive Computing Systems. InProceedings of the
IEEESymposium on Field-Programmable Custom Com-puting Machines
(FCCM’99), pages 222–230, April1999.
[21] Edward A. Lee.Advanced Topics in Dataflow Com-puting,
chapter Static Scheduling of Data-Flow Pro-grams for DSP. Prentice
Hall, 1991.
[22] X. P. Ling and H. Amano. WASMII: a Data DrivenComputer on a
Virtual Hardware. InProceedings ofthe IEEE Workshop on FPGAs for
Custom Comput-ing Machines (FCCM’93), pages 33–42, April 1993.
[23] Bruce Newgard. Signal Processing with Xil-inx FPGAs. , June
1996.
[24] Mark Oskin, Frederic T. Chong, and Timothy Sher-wood.
Active Pages: a Model of Computation forIntelligent Memory. In
Proceedings of the 25thInternational Symposium on Computer
Architecture(ISCA’98), June 1998.
[25] Stylianos Perissakis, Yangsung Joo, Jinhong Ahn,André
DeHon, and John Wawrzynek. EmbeddedDRAM for a Reconfigurable Array.
InProceedingsof the 1999 Symposium on VLSI Circuits, June 1999.
[26] Jan Rabaey. Reconfigurable Computing: The So-lution to Low
Power Programmable DSP. InPro-ceedings of the 1997 IEEE
International Confer-ence on Acoustics, Speech, and Signal
Processing(ICASSP’97), April 1997.
[27] Rahul Razdan and Michael D. Smith. A High-Performance
Microarchitecture with Hardware-Programmable Functional Units.
InProceedingsof the 27th Annual International Symposium
onMicroarchitecture, pages 172–180. IEEE ComputerSociety, November
1994.
[28] Paul T. Sasaki. A Fast FPGA (FFPGA) Using
ActiveInterconnect. InProceedings of the 1998 Interna-tional
Symposium on Field-Programmable Gate Ar-rays (FPGA’98), page 255,
February 1998.
[29] Edward Tau, Ian Eslick, Derrick Chen, JeremyBrown, and
Andŕe DeHon. A First GenerationDPGA Implementation. InProceedings
of the ThirdCanadian Workshop on Field-Programmable De-vices, pages
138–143, May 1995.
[30] William Tsu, Kip Macy, Atul Joshi, Randy Huang,Norman
Walker, Tony Tung, Omid Rowhani, Vargh-ese George, John Wawrzynek,
and André DeHon.HSRA: High-Speed, Hierarchical Synchronous
Re-configurable Array. InProceedings of the Interna-tional
Symposium on Field Programmable Gate Ar-rays, pages 125–134,
February 1999.
[31] John Villasenor, Chris Jones, and Brian Schoner.Video
Communications using Rapidly Reconfig-urable Hardware.IEEE
Transactions on Circuits andSystems for Video Technology,
5:565–567, December1995.
[32] John Villasenor, Brian Schoner, Kang-Ngee Chia,and Charles
Zapata. Configurable Computer Solu-tions for Automatic Target
Recognition. InPro-ceedings of the IEEE Workshop on FPGAs for
Cus-tom Computing Machines, pages 70–79. IEEE, April1996.
25
-
[33] Gregory K. Wallace. The JPEG Still Picture Com-pression
Standard.Communications of the ACM,34(4):30–44, April 1991.
[34] John A. Watlington. MagicEight: An Archi-tecture for Media
Processing and an Implementa-tion. Thesis proposal, MIT Media
Laboratory, Jan-uary 1999. .
[35] John A. Watlington and V. Michael Bove, Jr. A Sys-tem for
Parallel Media Processing. InProceedings ofthe Workshop on Parallel
Processing in Multimedia,April 1997.
[36] Michael J. Wirthlin and Brad L. Hutchings. DISC:the Dynamic
Instruction Set Computer. InProceed-ings of the SPIE Reconfigurable
Computing Confer-ence: Field Programmable Gate Arrays (FPGAs)
forFast Board Development and Reconfigurable Com-puting, pages
92–103, October 1995.
[37] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124.Virtex
Series FPGAs, 1999..
[38] Peixin Zhong, Margaret Martonosi, Pranav Ashar,and Sharad
Malik. Accelerating Boolean Satisfiabil-ity with Configurable
Hardware. InProceedings ofthe 1998 IEEE Symposium on
Field-ProgrammableCustom Computing Machines (FCCM’98),
pages186–195, April 1998.
26
-
Data-In
SwitchControl
SelectControl
Data-Out
True-Side
False-Side
Swit
ch
Sele
ct
Figure 22: switch -select Example Motivating Un-bounded Stream
Buffers
A Unbounded Stream Buffers
All levels of the SCORE computational model providethe
abstraction of unbounded stream buffers. Picking anyfinite buffer
size for streams would introduce an artifactinto the model which is
very difficult to reason about.In particular, programs would be
prone to deadlock anytime the number of tokens which the
application needed toqueue up on a stream between a pair of
operators was datadependent.
The canonical instance of this deadlock hazard is exem-plified
with a pair of nodes,switch and select.switchtakes in two inputs, a
boolean control stream, and a datastream. It sends its output along
one of two output streamsaccording to the value of the control
input.select takesin three inputs, a control stream and two input
streams.However, it does not read all three tokens on each
cycle.Rather, it first reads the control token. Based on the
valueof the control token, it then reads from one of the two
inputstreams and passes that along to its single output stream.
These two nodes can now be hooked up directly toeach other with
the two outputs of theswitch node con-nected to the two inputs of
theselect node as shownin Figure 22. We provide separate control
streams forthe switch and select nodes. Now, if there is evera
stream prefix of theswitch -control stream which con-tainsn more
TRUEs thanFALSEs (or vice versa) than theselect -control stream,
the stream between theTRUE sideof the switch and select nodes will
have to holdntokens. If the streams were limited to some
fixed-sizebufferm andm < n, then this subgraph would
deadlock.Without loss of generality, consider the case in which
theswitch node receivedn TRUEs followed by oneFALSE,while theselect
node initially receives oneFALSE con-trol signal followed byn
TRUEs. TheTRUE-side streamwould fill up with m tokens. Theswitch
node would
not be able to perform any more operations because it can-not
write data onto theTRUE-side stream. Theselectnode, however, must
process a token from theFALSE sidein order to continue, but there
are no tokens on theFALSEside to consume. Theselect node cannot
make forwardprocess until it is given a token on theFALSE side.
Theswitch node cannot make any progress until the down-stream
operator (theswitch ) consumes a token on theTRUE side. These two
operators are now deadlocked on acyclic dependence. Note that ifm
> n (orm unbounded),this deadlock would not occur and processing
would beable to proceed.
Since, in general, these control streams can be com-pletely
independent, it is not possible to say that they willhave any
particular property between them. If these controlstreams were
coming from outside of the system, we wouldcertainly not have any
control or knowledge of their rela-tionship. Even if they were
generated inside the system,the general question of whether or not
a given computationproduces a particular token value after a finite
number ofoperations is equivalent to the halting problem.
Therefore, in order to provide reasonable semantics tothe
programmer, we accept the unbounded buffer size ab-straction and
include support in the execution model to ex-pand finite buffers as
necessary to meet this abstraction (upto the limit of the amount of
memory we have available inthe system).
B SCORE Compute Model
Section 3.1 described the compute model informally.This section
defines it more precisely.
B.1 SCORE
SCORE computation is a graph,G:
G = {V,E}E = {e1, e2, ...}ei is a SFIFO
V = Vf ∪ Vt ∪ Vi ∪ Vovi ∈ Vf is aSFSMvi ∈ Vt is aSTMvi ∈ Vi is
aSINvi ∈ Vo is aSOUT
Notes:
• G will typically be initialized with at least one nodevs ∈ Vt
to start the computation.
27
-
• G may start with many nodes inV
B.2 SFIFO
SFIFO is a two-ended, unbounded FIFO which may becreated,
closed, and freed.
e = {psrc, psink, Q,EOS, FREE}EOS = Boolean
FREE = Booleanpsrc is a PORT
psink is a PORT
Q is a QUEUE
Operations:v ∈ V ≡ vertex from which the operation is
invoked.
Op Requirement Actionwrite(e,t) e.psrc ∈ outs(v)
t ∈ TdataQ→ eos() Q→add(t)
close(e) e.psrc ∈ outs(v)Q→ eos() Q→add(TEOS)
t=present(e) e.psink ∈ ins(v)FREE=false t=Q→ empty()
t=read(e) e.psink ∈ ins(v) t=Q→rm();FREE=false if (t ≡
TEOS)EOS=false EOS=true
t=eos(e) e.psink ∈ ins(v)FREE=false t=EOS
free(e) e.psink ∈ ins(v)FREE=false FREE=true
Notes:
• when anSFIFO is both freed and closed, it is removedfrom the
SCORE graph. (e.Q → eos() ≡ e.FREE≡true⇒ E = E − {e})
B.3 QUEUE
QUEUE is a an unbounded queue.
Q.data = ordered list ofT = {q0, q1, q2...qn−1}= {} when first
created
T = Tdata ∪ {TEOS}Tdata = finite alphabet
Q→ empty() ≡ value= (| Q.data |≡ 0)Q→ add(t) ≡ Q.data = {q0, q1,
q2...qn−1, qn = t}
Q→ rm() ≡
Q→ empty() value= q0;
Q.data = {q1, q2,...qn−1}
Q→ empty() ERROR
Q→ eos() ≡ value= (qn−1 ≡ TEOS)
B.4 SFSM
SFSM is an FSM with stream (SFIFO) I/O operations.
vf = {sc, sd, d, Sc, Sd, Sres, Pin, Pout, B}Sc = {s1, s2,
...sn}Sd = also finiteSd = Sres × Slocalsc ∈ Scsd ∈ Sdd ∈ ScB =
{b1, b2, ...bn}bi = {Ii, Ai, Fci, Fdi}Ii ⊂ PinAi = {ai,0, ai,1,
...ai,mi}ai,j ∈ {fi,j , wi,j , ci,j}csd = current data state∈
Sdwi,j = if (gi,j(csd)) write(pout ∈ Pout, vi,j(csd))ci,j = if
(gi,j(csd)) close(pout ∈ Pout)fi,j = if (gi,j(csd)) free(p ∈
Pin)
vi,j(csd) = F : Sd → Tdatagi,j(csd) = F : Sd → Boolean
Fci = F : Sd → ScFdi = F : Sd → SdId = {}Ad = {}Fcd = F : Sd →
{d}Fdd = csd (Identity)d = {Id, Ad, Fcd, Fdd}
Operation:1. Read: in statesi, if all inputs in Ii are present,
read
inputs into present state;7 otherwise, do nothing (stayin state,
perform no actions or transitions).
2. Action: perform all guarded writes, closes, and freesin Ai
whose guards are true.
3. Transition: update state according toFci andFdi.Notes:•
SFIFOoperations present and read are only available to
the read mechanism; they are not available for arbitraryuse
within theSFSM.• d is the done state; anSFSM is done when it
entersd.
7N.B.values of present state,csd, actually change to reflect
input val-ues.
28
-
• A well formed SFSM will close all output streams andfree all
input streams before making final transition tod.
• A properly specifiedSFSM will specify both the EOFand data
transitions; on EOF of an input, it should tran-sition only to a
state that does not read that input andwhich cannot reach any state
which can read said input.
• In a properly formedSFSMafter performing a close ac-tion on an
output, the machine will transition to a sta