Top Banner
COMPUTER VISION AND IMAGE UNDERSTANDING Vol. 64, No. 3, November, pp. 351–367, 1996 ARTICLE NO. 0064 Programming a Pipelined Image Processor THOMAS J. OLSON* Personal Systems Laboratory, Texas Instruments, Inc., P.O. Box 655303, Dallas, Texas 75265 JOHN R. TAYLOR FORE Systems, Pittsburgh, Pennsylvania 15237 AND ROBERT J. LOCKWOOD Digital Systems Design, Tektronix, Inc., 13975 SW Karl Braun Drive, Beaverton, Oregon 97077 Received September 2, 1994; revised June 22, 1995 Unfortunately, as these machines have become more powerful and sophisticated they have also become more Real-time computer vision systems often make use of dedi- cated image processing hardware to perform the pixel-oriented difficult to program. The difficulty of writing software for operations typical of early vision. This type of hardware is these machines limits the kinds of experiments that can be notoriously difficult to program, limiting the types of experi- performed and has become a serious obstacle to progress. ments that can be performed and posing a serious obstacle to This paper describes a software system that we have research progress. This paper describes a pair of programming developed to simplify the task of programming the Data- tools that we have developed to simplify the task of building cube MV20 [17], a pipelined image processor that is widely real-time early vision systems using special-purpose hardware. used for research in robot vision. The system presents The system allows users to describe computations in terms of an abstract view of the device’s capabilities, allowing the coarse-grained dataflow graphs constructed using an inter- programmer to focus on the computation to be performed active graphical tool. At initialization time it compiles these rather than the manipulations needed to map the computa- graphs into efficient executable programs for the underlying tion onto the hardware. Because it is based on an abstract hardware. The system has been implemented on a popular model of the machine, the system could be supported on commercial pipelined image processor. We describe the compu- tational model that the system supports, the facilities it provides other architectures a well. for building real-vision applications, and the algorithms used The core of the programming system is VEIL (Virginia’s to generate effective execution schedules for the target Extensible Imaging Library), a C11 library that provides machine. 1996 Academic Press, Inc. a dataflow abstraction for programming the underly- ing machine. VEIL represents computations as directed graphs whose nodes are standard image processing opera- 1. INTRODUCTION tors (add, subtract, convolve, etc.) and whose arcs repre- sent communications channels. User programs can con- Vision systems for robots and autonomous vehicles must struct VEIL graphs procedurally by calling functions that meet severe throughput and latency requirements, and instantiate nodes and link their inputs and outputs as they place extreme demands on current-generation com- needed. VEIL automatically maps the nodes and arcs of puting hardware. Pipelined image processors have proven the graph onto the underlying hardware, breaking the to be a cost-effective way to meet these demands, at least graph into subgraphs if the computation is too complex to for early visual processing, and have become very popular be performed in one step. User programs can control or among researchers working on real-time vision problems. interact with running VEIL graphs in various ways, making it easy to synchronize vision tasks with robot control com- putations. Extension mechanisms allow VEIL to support * Correspondence should be addressed to: Thomas J. Olson, Texas the development of special-purpose routines or the incor- Instruments, Inc., P.O. Box 655303, m/s 8374, Dallas, TX 75265. E-mail: [email protected]. poration of new hardware. 351 1077-3142/96 $18.00 Copyright 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
17

Programming a Pipelined Image Processor

Feb 23, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Programming a Pipelined Image Processor

COMPUTER VISION AND IMAGE UNDERSTANDING

Vol. 64, No. 3, November, pp. 351–367, 1996ARTICLE NO. 0064

Programming a Pipelined Image ProcessorTHOMAS J. OLSON*

Personal Systems Laboratory, Texas Instruments, Inc., P.O. Box 655303, Dallas, Texas 75265

JOHN R. TAYLOR

FORE Systems, Pittsburgh, Pennsylvania 15237

AND

ROBERT J. LOCKWOOD

Digital Systems Design, Tektronix, Inc., 13975 SW Karl Braun Drive, Beaverton, Oregon 97077

Received September 2, 1994; revised June 22, 1995

Unfortunately, as these machines have become morepowerful and sophisticated they have also become moreReal-time computer vision systems often make use of dedi-

cated image processing hardware to perform the pixel-oriented difficult to program. The difficulty of writing software foroperations typical of early vision. This type of hardware is these machines limits the kinds of experiments that can benotoriously difficult to program, limiting the types of experi- performed and has become a serious obstacle to progress.ments that can be performed and posing a serious obstacle to This paper describes a software system that we haveresearch progress. This paper describes a pair of programming developed to simplify the task of programming the Data-tools that we have developed to simplify the task of building cube MV20 [17], a pipelined image processor that is widelyreal-time early vision systems using special-purpose hardware.

used for research in robot vision. The system presentsThe system allows users to describe computations in terms ofan abstract view of the device’s capabilities, allowing thecoarse-grained dataflow graphs constructed using an inter-programmer to focus on the computation to be performedactive graphical tool. At initialization time it compiles theserather than the manipulations needed to map the computa-graphs into efficient executable programs for the underlyingtion onto the hardware. Because it is based on an abstracthardware. The system has been implemented on a popularmodel of the machine, the system could be supported oncommercial pipelined image processor. We describe the compu-

tational model that the system supports, the facilities it provides other architectures a well.for building real-vision applications, and the algorithms used The core of the programming system is VEIL (Virginia’sto generate effective execution schedules for the target Extensible Imaging Library), a C11 library that providesmachine. 1996 Academic Press, Inc. a dataflow abstraction for programming the underly-

ing machine. VEIL represents computations as directedgraphs whose nodes are standard image processing opera-

1. INTRODUCTION tors (add, subtract, convolve, etc.) and whose arcs repre-sent communications channels. User programs can con-

Vision systems for robots and autonomous vehicles must struct VEIL graphs procedurally by calling functions thatmeet severe throughput and latency requirements, and instantiate nodes and link their inputs and outputs asthey place extreme demands on current-generation com- needed. VEIL automatically maps the nodes and arcs ofputing hardware. Pipelined image processors have proven the graph onto the underlying hardware, breaking theto be a cost-effective way to meet these demands, at least graph into subgraphs if the computation is too complex tofor early visual processing, and have become very popular be performed in one step. User programs can control oramong researchers working on real-time vision problems. interact with running VEIL graphs in various ways, making

it easy to synchronize vision tasks with robot control com-putations. Extension mechanisms allow VEIL to support* Correspondence should be addressed to: Thomas J. Olson, Texasthe development of special-purpose routines or the incor-Instruments, Inc., P.O. Box 655303, m/s 8374, Dallas, TX 75265. E-mail:

[email protected]. poration of new hardware.

3511077-3142/96 $18.00

Copyright 1996 by Academic Press, Inc.All rights of reproduction in any form reserved.

Page 2: Programming a Pipelined Image Processor

352 OLSON, TAYLOR, AND LOCKWOOD

Although programming in C11 using VEIL is straight- dataflow paradigm would provide adequate flexibility forthe programmer and would also map well to most dedi-forward, the compile-edit-debug cycle makes exploratory

programming tedious. To simplify program development, cated image processors. This line of research led to thedevelopment of VEIL and MERLIN.we have written a graphical user interface called MERLIN

that runs on top of VEIL and supports rapid prototypingof early vision algorithms. MERLIN allows users to create, 2.1. Related Workrun, and modify VEIL graphs by drawing them on the

Software environments for computer vision and imageworkstation screen. Graphs constructed using MERLINprocessing have been under development for many yearscan be loaded into VEIL applications and controlled inand take a variety of forms. They generally perform onethe same manner as procedurally constructed graphs.or more of the following functions:The remainder of this paper is organized as follows: the

next section describes the context of our work and presents • Encapsulating primitives. Image processing algo-the design goals of the VEIL/MERLIN system. An over- rithms tend to rely heavily on repeated applications ofview of the MV20 architecture is given in Section 3. Sec- a small number of well-understood primitive operations:tions 4 and 5 describe VEIL and MERLIN from the user’s convolution, pixelwise arithmetic, point transforms, and sopoint of view. Section 6 describes VEIL’s internals, with on. Most image processing environments provide encapsu-particular attention to the scheduler. Section 7 presents lated implementations of these primitives. These may takeconclusions and future plans. the form of subroutine libraries [27], stand-alone execut-

able programs [28, 37], or primitive commands in a textual2. BACKGROUND or graphic shell [50].

• Interactive composition of primitives. Most modernOur primary motivation for writing VEIL has been to systems allow users to invoke sequences of primitive opera-

support research by ourselves and others in robot vision. tions interactively and to observe the results. Often theWe are particularly concerned with problems in active vi- primitives are made available as stand-alone executablession [3, 8, 9, 44]. The goal of active vision is to allow operating on files or pipes, as in HIPS [28], Vista [37],autonomous agents (vehicles or robots) to use visual infor- and the Khoros base system [38]. Other systems providemation to accomplish tasks in the real world. Active vision invocation via menus [4, 50]. Some systems allow primitiveswork is characterized by an emphasis on interactions be- to be both defined and invoked using an interactive inter-tween sensing and behavior. Visual information is gathered preted language such as LISP [29] or tcl [10, 35].incrementally and is used to guide real-time decisions • Analysis and display. All systems provide some facil-about how sensor parameters (such as viewpoint) should ity for examining the results of an image processing compu-be controlled and how incoming images should be pro- tation, e.g., displaying images on a monitor. Some alsocessed. include sophisticated data analysis and plotting tools.

Active vision applications make extreme demands on • Software engineering. Some image processing systemsthe underlying hardware and software systems. Typical (see Lawton and McConnell [29] for a survey) are expresslycomputations require fast on-line processing of large designed to support the construction of large, complexamounts of image data. Latencies must be kept small to applications, and include facilities for version control,allow the agent to respond quickly to changes in the envi- image and code database management, and automaticronment. Image computations are tightly interleaved with interface generation.control and planning activities, so the environment usedfor image processing must also be able to support more In recent years coarse-grained dataflow [6] has emerged

as the paradigm of choice for interactive construction ofgeneral computations and I/O activities. The type of imageprocessing to be performed may change rapidly in response image processing programs. In this paradigm the computa-

tion is expressed as a directed graph, usually constructedto shifts in the agent’s goals and priorities.The extraordinary processing requirements of robot vi- using a graphical user interface. The nodes of the graph

are the primitive operations provided by the system, andsion applications exceed the capabilities of current genera-tion computers and make the use of dedicated image pro- have named input and output ports corresponding to the

arguments and outputs of the operations. Conceptually,cessing hardware necessary. The nature of research, on theother hand, demands systems that support rapid prototyp- the source nodes of the graph generate potentially infinite

sequences of images. Images flow through the arcs of theing and exploratory programming; but dedicated imageprocessing systems are notoriously difficult to program. graph, triggering execution of a primitive function when-

ever they arrive at a node. This paradigm was first appliedThis conflict motivated us to consider ways of makingprogramming easier. Our previous experience in building to image processing in the HI-VISUAL language [22] and

has also been applied to scientific visualization [48, 42],coarse-grained dataflow systems [34] suggested that a

Page 3: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 353

signal processing [30], simulation [12], and data acquisition the MV20. They owe their general appearance and interac-tion style to earlier dataflow visual languages such as AVS[24]. A number of visual dataflow languages have been

built to explore issues such as user interface design [46] and [48] and MAVIS [34] and their underlying computationalmodel to signal processing systems such as Gabriel [30,implementation efficiency [34]. The paradigm has become

well known via the widely distributed Khoros/Cantata sys- 32]. VEIL differs from earlier systems primarily in its sup-port for building embedded applications that combine im-tem [38] and is now being retrofitted to older image pro-

cessing and vision systems such as KBVision [4]. age processing with higher level vision and control compu-tations. In particular, it provides a novel and elegantThe concerns of programming systems for dedicated im-

age processing hardware differ somewhat from those of the mechanism for communication between dataflow pro-grams running on an accelerator and conventional pro-more general-purpose systems described above. Processing

speed is the raison d’etre of exotic architectures, so their grams running on the host. VEIL produces highly efficientcode and targets a type of architecture that has not beenprogramming interfaces tend to favor efficient use of the

machine over ease of programming. They often mirror the addressed in previous work. VEIL and MERLIN havebeen fully implemented and are in daily use at a numberstructure of the hardware in the type of interface they

present. The programming environment for the Aspex of sites.PIPE [26], for example, allows users to construct programs

2.2. Design Goalsby manipulating graphical representations of the machine’sregisters and datapath elements. WitFlow [16] and Zebra/ The design of VEIL has been driven by our desire toZED [47] provide a similar level of access to Datacube conduct experimental research in active vision using real-products. Software tools of this type can make program- time image processing hardware. We encountered manyming these architectures much more pleasant than it would frustrations in trying to do this using existing tools. Theseotherwise be. However, they require the programmer to experiences led us to establish the following requirementsspecify in detail how the computation should be mapped for the design:onto the hardware. The resulting programs are closely tiedto the target architecture and, hence, are nonportable and • Ease of use. The most fundamental requirement on

the system is that it be easy to learn and easy to use. Inhard to modify.There have been some attempts to devise more ab- particular, programmers with expertise in vision and AI

should be able to write useful programs within a day ofstract languages for programming image processing andvision hardware. The text-based languages APPLY [20] being introduced to the system.

• Rapid prototyping. The ease of use of the system(for neighborhood operators) and ADAPT [49] (forsplit/merge computations) allow restricted classes of image should extend to the development cycle. Instead of devel-

oping their algorithms on a workstation and then portingprocessing operators to be expressed in an Ada-like lan-guage with data-parallel extensions. They were originally them to the real-time environment, users should be able

to develop and debug applications directly on the special-designed to support vision applications on the Warp [5]systolic array but have been ported to a number of other purpose hardware.

• Efficiency. Ease of use must be achieved without tooarchitectures as well. INSIGHT [40] is a text-based data-flow language for programming reconfigurable networks great a sacrifice in terms of speed. Some loss of efficiency

is probably inevitable, but it should not be inherent in thesuch as the PROTEUS vision machine [21]. It is one ofthe few systems designed to handle intermediate and high- design of the system. When application-specific optimiza-

tions are necessary, they should be accomplished by ex-level vision computations as well as image processing. Theprice of this generality is that programmers are required tending the base system rather than by abandoning it and

recoding in some lower level language.to abandon imperative programming constructs and codein a definitional dataflow language [1]. • Flexibility. Active vision computations take place in

a behavioral context and are often tightly interleaved withOther researchers have described coarse-grained data-flow visual programming environments for image pro- planning and motor control activities. For this reason, the

system may not place undue restrictions on the form thatcessing hardware. The Khoros system [38] can be in-structed to distribute processes across a local area network. user programs can take. It must allow image processing

operations to be embedded in arbitrary C programs andStewart and Dyer [43] proposed a coarse-grain dataflowenvironment for the PIPE [26], and they described heuris- must allow the program to start and stop image processing

computations or modify their parameters at any time.tic algorithms for mapping dataflow graphs to the hard-ware. Zeltner et al. [52] describe a similar programming • Extensibility/portability. The system should be exten-

sible at several levels. First, it should support incrementalsystem for the proposed SYDAMA-2 vision engine.VEIL and MERLIN are outgrowths of our earlier at- performance enhancement. Programmers should be able

to develop custom image processing routines that are opti-tempt [45] to develop a visual programming language for

Page 4: Programming a Pipelined Image Processor

354 OLSON, TAYLOR, AND LOCKWOOD

memories, and a set of deeply pipelined processing ele-ments.

The switch is the heart of the device and is responsiblefor its considerable flexibility. It allows any 10 of 32 inputsto be connected to any of 32 outputs. Each connectionallows eight-bit data to be transmitted at 20 Mhz, providinga maximum bandwidth of 640 Mbytes per second.

Each of the six memories has a capacity of 1 Mbyte(expanded to 4 or 16 Mb in recent versions of the device).Each memory also contains a pair of address generatorsthat allow it to function as a two-dimensional vector regis-ter. The address generators can transfer rectangular arraysFIG. 1. Block diagram of the MV20. The principal components areof eight-bit pixels in raster-scan order between the switchthe switch, the memories, and the four processing devices (AS, AG, AP,

and AU). and the memory at a rate of 20 Mhz. Each memory hasone read port, one write port, and one read/write portsupporting VME access. All three ports can be active si-multaneously.mized for particular applications and to use them freely

The processing elements perform computations on rasterin combination with the standard primitives provided byarrays received from the switch or generated internally.the system. Second, the system should support incorpora-Each device is itself a complex network of multiplexors,tion of new hardware as it becomes available and shouldswitches, ALUs, etc. The basic services are provided bybe based on a model that is general enough to allow portingfour devices:to other architectures in the future.

• AS device: analog to digital conversion of video signalsOne goal that we did not adopt for VEIL was that ofin a variety of formats.providing direct support for high-level vision computa-

• AG device: analog video output with overlay.tions. The immediate reason for this decision was that the• AU device: ALU operations, multiplication, weightedtarget architecture is optimized for low-level pixel pro-

sums, and statistics.cessing and provides no support for more abstract image• AP device: 8 3 8 or paired 4 3 8 convolution, 8 3 8descriptions. More generally, we felt that the dataflow par-

and 16 3 16 lookup tables, and histogramming.adigm encourages a functional, value-oriented program-ming style that is well suited to image processing, but All of the devices are designed to operate on pixel arrayspoorly suited to high level vision. For the latter, an object- presented in raster-scan order at 20 Mhz.oriented style such as that provided by the Image Under-standing Environment [27] is more natural. Accordingly, 3.1. Programming ModelVEIL programs are expected to use dataflow to express

The MV20 does not have a program counter or instruc-their pixel-oriented operations and to perform higher leveltion set and, hence, does not execute programs in theanalysis in conventional, sequential C11 code. Whatusual sense. Instead, the host computer (typically a UNIXVEIL does provide (and many image processing systemsworkstation) controls the device’s operation by storing val-do not) is a natural, flexible way to combine dataflow imageues into appropriate registers. In order to perform a com-processing and higher-level sequential code in a singleputation, the host configures the switches, multiplexors,program.and processing elements to form a network that performsthe desired operations on any data that pass through it. It3. THE HARDWAREthen commands the device to transmit rectangular arraysof pixels from one or more data sources (the AS deviceVEIL is designed to be independent of the underlying

hardware, but its design has been strongly affected by the or memories) to the network inputs and to store the dataappearing at the network outputs in one or more datacapabilities of the particular device chosen as the initial

target architecture: the Datacube MV20. In this section we sinks (the AG device, memories, or the statistics or histo-gramming sections of the processing elements). For com-describe the MV20 architecture and programming model.

The Datacube MaxVideo 20 [17] is a real-time image putations that are too complex to be performed in a singlepass, the host can repeat the process using the stored resultsprocessing system that is widely used for active vision re-

search. Figure 1 shows a block diagram of the device. The of the previous computation as inputs to the text.The standard programming interface to the MV20 is aMV20 architecture can be broken down into three main

sections: a programmable switch, a set of ‘‘smart’’ image vendor-supplied software package called ImageFlow [18].

Page 5: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 355

ImageFlow consists of a sophisticated device driver and a for ensuring that each operand resides in a distinct memorybank, that the right lookup tables, convolution kernels,library of functions that can be called from user programs

written in C. The library conceals the necessary low-level etc. are loaded into the right places, and that no datapathor computational element is used twice in the same pipe.register manipulations from the programmer, allowing him

or her to describe the computation by referring to the The result of this ‘‘explicit reference’’ problem is to makecode reuse extremely difficult. Every ImageFlow programprocessing elements and switch ports by name.

A typical ImageFlow program begins by constructing contains dozens of hard-coded assumptions about how dataare laid out in memory and how they flow through theone or more processing networks (called pipes in Image-

Flow terminology). It does this by calling ImageFlow func- device. Any two ImageFlow programs invariably have in-compatible memory maps and make conflicting demandstions that declare pixel arrays in memory, establish connec-

tions between the appropriate pixel arrays, switch ports on computational resources. The net result is that modi-fying an existing ImageFlow application (e.g., by mergingand processing elements, and set the processing attributes

(e.g., convolution kernels and filter parameters) as desired. it with another) is almost as difficult as rewriting it fromscratch.ImageFlow automatically configures the MV20’s memory

address generators so that the timing of the raster arraysreflects the desired geometric relationship between them. 4. VEILFor example, if two images are to be added ImageFlowensures that corresponding pixels arrive at the ALU at the VEIL is a programming library for real-time image

processing on the Datacube MV20. It provides C11 classsame time.Once the pipes have been constructed, the application definitions that allow programmers to describe image

processing computations in terms of coarse-grained data-program has a number of options for controlling theirexecution. It can request that they be fired or executed flow graphs. It translates these graphs to appropriate

ImageFlow structures during program initialization. Theeither once or continuously. Firing is nonblocking, so theapplication can fire a pipe, perform other work, and then computational model that it supports is a form of homoge-

neous synchronous data flow (HSDF) [32]. In this para-perform a blocking wait for the event associated with thepipe. A series of pipes can be packaged into a pipe altering digm, each node of the dataflow graph produces exactly

one token per arc per invocation. VEIL adds the restrictionthread (PAT) and uploaded to the MV20 device driver,which will execute the pipes in sequence as rapidly as that all data sources are synchronous and fire simultane-

ously. Under this restriction the flow of tokens throughpossible, either once or repeatedly. This feature allowsuser programs to perform other tasks while the MV20 the graph depends only on the topology of the graph. This

implies that a valid execution schedule can be computedworks on a complex (multipipe) computation.statically, minimizing run-time overhead.

A pure HSDF paradigm does not allow the structure or3.2. The Programming Problemexecution order of a graph to change once the graph hasbeen created. In order to provide greater flexibility, VEILThe ImageFlow library insulates the programmer from

many of the idiosyncracies of the hardware and provides allows applications to create multiple graphs and to start,stop, or modify them during execution. This ‘‘escape’’a powerful and flexible way of controlling MV20 computa-

tions. However, programming the MV20 in Imageflow re- mechanism provides the flexibility needed for active visionapplications. In particular, it allows user code to implementmains painful even for relatively simple computations. The

most obvious problem is simply that the device is extremely conditional or iterative control structures when the applica-tion requires them.complex. For example, for the AU device alone the hard-

ware reference manual identifies 109 distinct functional VEIL addresses most of the programming problemsidentified in the previous section. It handles the complexityelements and provides six pages of block diagrams showing

their interconnections. The result is that even trivial pro- problem by abstracting the functionality of the MV20 andpresenting it as a set of reusable image processing primi-grams require dozens of function calls.

A second, subtler source of difficulties in programming tives. It addresses the problem of explicit reference byconcealing the physical resources of the MV20 from thethe MV20 arises from the fact that ImageFlow programs

describe computations by referring explicitly to the physi- programmer, providing instead an unbounded number ofvirtual resources. The fact that there is only one physicalcal resources (memories and computational elements) re-

quired to produce them. Programmers cannot simply state convolver does not prevent the program from buildinga graph with three convolution nodes into it; the VEILthat two image streams are to be added; they must specify

which of several ALUs is to be used, where in memory scheduler simply detects the conflict and schedules theconvolutions sequentially, allocating memory buffers asthe images are stored, and what path the images should

take to reach the ALU. The programmer is responsible needed to store the intermediate results. This allows the

Page 6: Programming a Pipelined Image Processor

356 OLSON, TAYLOR, AND LOCKWOOD

VEIL programmer to describe the computation in terms directly to the lookup table for thresholding and is thenbuffered in memory bank zero. The VEIL program makesof the data streams and their transformations, rather than

in terms of what each physical device should be doing at no such commitment and, hence, can be modified or ex-tended without fear of introducing conflicts in resource oreach point in time.switch usage. However, the VEIL program does pay aprice for its greater flexibility. Although both programs4.1. Programming with VEILachieve the maximum possible throughput (30 frames persecond), the VEIL program buffers the camera output inThe fundamental activity that VEIL supports is con-

struction and execution of dataflow graphs. It provides memory, introducing an additional latency of half a videoframe time.the C11 programmer with two new object classes (data

types): graphs and operators. Graphs correspond to single VEIL provides a number of ways for applications pro-grams to control the execution of data-flow graphs. TheHSDF computations, each consisting of a set of operators

(graph nodes) corresponding to image processing primi- Cycle() method used in the above example instructs theImageFlow device driver to begin the next iteration oftives. Each operator has some number of input and/or

output ports, which are connected by links to form a graph. the graph immediately upon completion of the previousiteration. (For graphs whose input comes from one or moreOperators also have attributes that specify details of their

computation. Some attributes are common to all operators. video cameras, the next iteration of the graph is deferreduntil the start of the next full video frame.) Cycle()For example, every input port on an operator has an attri-

bute that specifies the size and relative position of the pixel is non-blocking. The Run() method causes the graph toexecute once and then halt, blocking the host applicationarray that the port processes. Most attributes are specific

to individual operator types, however. These include such until the graph computation completes. A nonblockingversion of Run() is also provided, so that the applicationthings as the masks for convolution of morphological oper-

ations, the contents of lookup tables, and the output scaling can work in parallel with the MV20 computation if desiredand then synchronize with the graph via a blockingWait()for arithmetic operations.

Most programs that use VEIL begin by creating a graph call. The application can halt a graph at any time to replaceit with another graph whose computation is better suitedobject and some number of operators. Next, the operators

are inserted into the graph, their attributes are set and to the current sensing situation.VEIL applications can modify the computation per-their inputs and outputs are connected to complete the

graph. The program may proceed to create other graphs formed by a graph (as opposed to halting it and startinganother graph) by changing the attributes of the graphor perform any other initialization it requires. The program

then invokes the VEIL scheduler, which verifies topologi- operators. For example, after calling graph.Cycle() theVEIL program in Fig. 2 could change the threshold of thecal correctness and builds an execution schedule for each

graph. The scheduler arranges for the graphs to share phys- running graph by calling threshold.setThreshold(-value). Similar calls allow modification of convolutionical resources such as memory buffers, lookup table banks,

and convolution kernels whenever possible. Finally, the kernels, constant multipliers and other computational pa-rameters. Unfortunately, ImageFlow does not allow on-application executes the graph(s) by invoking VEIL’s exe-

cution control methods. the-fly changes to attributes that affect the MV20’s internalpipeline delays. Thus it is not possible, for example, toFigure 2 presents an example of a simple MV20 compu-

tation as expressed in VEIL and in ImageFlow. At upper change the size or position of the pixel array being sentto a given operator without halting the graph and resched-left is an abstract, graphical representation of the computa-

tion as the programmer might think of it during program uling it.Programs that use multiple graphs may need to use datadesign. The computation graph consists of a CAMERA (i.e.,

the MV20 A/D converter configured to capture RS-170 produced by one graph as input to another graph. VEILsupports this through a pair of operators called SOURCEvideo data) connected to a THRESHOLD operator, which is

in turn connected to a MONITOR for display on the VGA and SINK. These operators correspond to pixel arrays inthe MV20’s memories; the SOURCE operator injects its pix-monitor. The VEIL declarations and code needed to build

and execute the graph procedurally are shown at lower els into the graph as input, while the SINK operator storesthe data it receives in MV20 memory. The scheduler canleft. At right is a hand-optimized program that performs

the same computation using raw ImageFlow. The differ- be instructed to treat a SINK/SOURCE pair as aliases for thesame physical memory buffer. The result is that pixels sentence in apparent complexity between the ImageFlow and

VEIL programs is obvious. A less obvious but equally to the SINK operator by one graph will be fed into the othergraph by the SOURCE operator when that graph is executed.important difference is that the ImageFlow program com-

mits to a particular strategy for performing the computa- VEIL provides several mechanisms for transferring in-formation between image processing graphs and the hosttion: the incoming image is sent from the A/D converter

Page 7: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 357

FIG. 2. Three formulations of an MV20 computation that takes pixel data from the camera, thresholds it, and sends it to the monitor for display.At upper left is a conceptual representation; below it is the equivalent VEIL code. At right is a hand-optimized implementation using ImageFlow.

Page 8: Programming a Pipelined Image Processor

358 OLSON, TAYLOR, AND LOCKWOOD

application. A number of operators are specifically de- MERLIN makes the requested changes immediately. Asmentioned in Section 4, some attribute changes requiresigned to gather information for the host and do not have

output ports. Instead these operators provide methods that that the graph be halted and rescheduled. Since reschedul-ing may take several seconds, attribute changes of this typeallow the host to retrieve the current contents of their

buffers. For example, the HISTOGRAM, STATISTICS, and SSD are deferred until the user halts the graph or explicitlyrequests rescheduling. Visual cues are provided to inform(sum of squared difference) operators accumulate informa-

tion about the grey levels in one or two images by summing the user that deferred attribute changes are pending.Both MERLIN and VEIL allow users to save graphs toover the image array and allow the host to read a vector

containing their results. The IMPORT operator receives an files and load them in again. This allows graphs to bedeveloped in the interactive MERLIN environment andarray of pixels from the graph, and allows the host to

read any specified subarray into user memory. The EXPORT then loaded into embedded, noninteractive VEIL applica-tions. The application can gain access to the operators byoperator serves as a companion to the IMPORT operator,

allowing the host to copy pixels from user memory to an looking them up in a name table. Names consist of ASCIIstrings and appear on the MERLIN icon representing theMV20 memory buffer that serves as a source in a graph.

This operator is useful, for example, for injecting test im- operator instance. Thus, instead of constructing a graphprocedurally as in Fig. 2, an application might begin byages into a graph that is under development. The data

transfer methods for these operators may be called at any loading it from a file and calling graph.Find() to obtaina handle for the THRESHOLD operator. Experience hastime. If they are called while the graph is running or cycling,

they will block until the end of the execution cycle in order shown that most users prefer to build applications this way,because it allows them to change the image processingto reduce the risk of read/write conflicts.

Using the VEIL information-gathering operators as de- their program does without recompiling. They typicallykeep the MERLIN window open all the time, turning offscribed above requires the programmer to keep track of

the graph state explicitly. In order to support a more event- interactive execution to test their applications and turningit back on to modify and save the graph.driven style of programming, VEIL also allows callback

functions to be associated with these operators. When a A sample MERLIN screen is shown in Fig. 3. The graphdisplayed in the work area performs a set of image pro-callback is added to an operator, the VEIL scheduler in-

structs the ImageFlow device driver to generate an event cessing operations used by Horswill and Brooks for anexperiment in reactive robot control [23]. The two nodesupon completion of the Image Flow pipe that terminates

in that operator. If the program calls graph.Update( ) at the top of the graph digitize the camera input and sub-sample it by a factor of four in each dimension, producingwhile the graph is cycling, VEIL will wait for the events

in the order in which they were scheduled and will call the a 128 3 128 image. The graph then splits into two separatestreams. The right-hand stream approximates a temporalappropriate callback for each event as it arrives.derivative by backward differencing. The left-hand streamperforms logarithmic scaling using a look-up table, esti-5. MERLINmates the gradient using a Sobel operator, and computesthe sum of the absolute values of the gradient componentsMERLIN is a graphical user interface that allows users

to build and execute VEIL graphs interactively. It is written as an approximation to the gradient magnitude. The resultis passed through a threshold, which can be altered inter-in C11 using the standard VEIL library interface and

the SUIT user interface toolkit [36]. The functionality it actively using the dial widget at left. Both the time deriva-tive and thresholded gradient magnitude are sent to theprovides is similar to that of Cantata (Khoros) [38] and

the other dataflow environments discussed in Section 2. host for display in a pair of X windows. The status line atlower left reports that VEIL’s scheduler broke the compu-The user creates a graph by selecting VEIL operators from

a bank of menus, placing them in the workspace, and con- tation into three sequentially executed subgraphs (pipes)and achieved a data rate of 30 frames per second. Thenecting their inputs and outputs with the mouse. Once the

graph has been built it can be executed by pressing the data rate is limited by the input video frame rate; for thisparticular graph the MV20 is idle 45% of the time.‘‘Cycle’’ button, which invokes the VEIL scheduler and

cycles the graph.MERLIN supports graph editing by standard direct ma- 6. VEIL INTERNALS

nipulation techniques: operators and links can be reposi-tioned, created, or destroyed using the mouse. Selecting The internal architecture of VEIL is divided into four

logical components. The graph manager consists of thean operator with the mouse causes that operator’s attri-butes to be displayed on a scrollable palette of appropri- data structures and functions that allow user programs to

construct graphs and set their attributes. The operator setately typed widgets. The user can use these widgets tochange the value of an attribute at any time. If possible, is the set of available image processing primitives. The

Page 9: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 359

FIG. 3. A typical MERLIN screen. The illustrated graph subsamples the image and computes the temporal derivative and the filtered andthresholded gradient magnitude, as described in [23].

scheduler is responsible for mapping the computation de- serves as a repository for information about each graphobject, such as whether it has been scheduled or not,scribed by a VEIL graph onto the hardware and building

an execution schedule. Finally, the executive communicates whether it is running, and what physical resources it uses.Finally, it serves as a launch point for the executive. Sincewith the ImageFlow device driver and allows user programs

to control the computation as desired. the graph is the smallest executable object in VEIL, execu-tion control commands are naturally expressed as methods

6.1. The Graph Manager of the graph class.

The role of the graph manager is relatively simple: it 6.2. The Operator Setdefines the basic graph class and provides the methodsneeded to create or modify computation graphs. The basic The design of the operator set is critical to the operation

of the other components of the system. The variousoperations it supports are creating or destroying a graph,inserting or deleting an operator, and creating or destroy- operator types are organized into a class hierarchy,

and all are ultimately subclasses of the generic classing a link between operators. The graph manager also

Page 10: Programming a Pipelined Image Processor

360 OLSON, TAYLOR, AND LOCKWOOD

Operator. Every operator instance provides the follow- In simple cases the scheduler may be able to embedthe input VEIL graph directly in the MV20 hardware.ing information:Figure 4 shows an example of such a situation. The graph

• Port descriptors. For each input or output port, thereat left takes an array of pixels supplied by the host (viais a descriptor that identifies any links that are attachedan EXPORT operator), convolves it with a mask, and returnsto the port and (for input ports) the size and alignment ofthe absolute value of the result to the host via an IMPORTthe incoming pixel array. This information is normally setoperator. The CONVOLVE and ABS (absolute value) opera-by the application programmer. The descriptor also speci-tors have disjoint resource lists, so they can be run as partfies the name of the physical switch port on the MV20 atof the same computational pipeline, as shown at right.which the port produces or accepts data. This informationNote that since there is no buffering between the twois derived from the class definition and is accessible onlycomputational operators, the presence of the second oper-to the scheduler.ator has essentially no impact on the running time of the

• Attribute descriptors. For each attribute there is a de-graph; the execution time is given by the number of pixelsscriptor that specifies its name (as an ASCII string), itsin the rectangle divided by the pipeline data rate (20 Mhz).type, and its current value.In order to build an ImageFlow pipe for this computation,

• Resource list. Every operator class contains a list ofthe scheduler calls the build methods for each of the opera-the physical MV20 resources it uses. These include thetors. The build methods for the CONVOLVE and ABS opera-switch ports mentioned in the port descriptors, plus anytors construct partial pipes, as described above. The buildmultiplexors or computational elements that it requires inmethods for the IMPORT and EXPORT operators invoke aorder to perform its function. The scheduler uses this listresource allocator to reserve space in memory for theirto determine whether a set of operators can run in parallel.pixels and to update the switch port addresses contained

• Build method. Every operator defines a procedurein their port descriptors. The scheduler then makes connec-called its build method. When invoked, the build methodtions through the switch to the ports identified in the portconstructs a partial ImageFlow pipe that begins and endsdescriptors, completing the pipe.at the physical switch ports identified in the port descrip-

In more complex cases it may not be possible to scheduletors. In between, the pipe passes through the computa-the entire VEIL computation as a single ImageFlow pipe.tional elements listed in the resource list. The build methodFor example, the graph may contain operators whose re-reads the current values of the attributes and configuressource needs are in conflict, or may require more connec-the computational elements so that they transform the datations than the switch can support. The VEIL schedulerpassing through them appropriately.handles these more difficult cases by partitioning the graph

The graph manager, scheduler, and executive manipu- into a set of subgraphs, each of which is small enough tolate VEIL operators at the superclass level, using the infor- be run as a single pipe. Outgoing data flow arcs that crossmation and methods described above. Dynamic inheri- a partition boundary are connected to temporary bufferstance allows them to do their jobs without any hard-coded in MV20 memory. These buffers then become inputs toknowledge of the individual operator classes. This makes the subgraphs whose input arcs cross the partition bound-it relatively straightforward for programmers familiar with ary. The subgraphs are executed sequentially inside anbasic ImageFlow programming to add new operator types ImageFlow PAT, in such a manner that the order of execu-to VEIL. They simply declare a new subclass of class tion of the operators is a topological sort of the inputOperator and supply appropriate descriptors and meth- graph. This guarantees that the introduction of memoryods. The process is not painless, but it is much easier than buffers does not change the result of the computation.writing a complete ImageFlow pipe. Once it is done, the The graph of Fig. 3 illustrates some of the problemsnew operator automatically inherits schedulability from posed by more complex graphs. The LUT 1 and THRESHOLDthe Operator class and can be used without modification operators cannot execute simultaneously, because theyin any VEIL application. each require exclusive access to the MV20’s single-input

look-up table. The SHRINK, DELAY, and XWINDOW opera-6.3. The Schedulertors must be handled specially because they contain inter-nal memory buffers. The SHRINK operator, which subsam-The job of the VEIL scheduler is to compile the data

structure describing a VEIL graph into a form that the ples the image by a constant factor, is implemented bycommanding the vector address generator of a memoryexecutive and the ImageFlow device driver can use to

perform the computation. In this section we first present write port to omit some of the rows and columns sent toit. Thus a SHRINK operation can only be performed ininformal examples of the scheduling problem and then

define the problem formally and describe the VEIL sched- the course of storing an image into memory. The DELAY

operator uses a memory buffer to store pixels receiveduling algorithm. Finally, we discuss the scheduler’s perfor-mance and limitations. during the current iteration and releases them into the

Page 11: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 361

FIG. 4. (a) A simple VEIL graph. (b) the same graph mapped to a single pipelined computation on the MV20 (cf. Fig. 1).

graph during the next iteration. The XWINDOW operator cannot add the THRESHOLD operator to the current pipe,because its resource needs conflict with those of LUT1,uses the VEIL callback mechanism to transfer pixels from

the MV20 to the host and display them on the worksta- and it cannot copy the SHRINK buffer to the DELAY bufferbecause the latter is in use. In the third and last partition,tion screen.

Figure 5 shows how the VEIL scheduler might partition the contents of TEMP1 are sent to the XWINDOW2 buffer viathe THRESHOLD operator, and the SHRINK buffer is copied tothe graph of Fig. 3. The first partition simply copies the

CAMERA buffer to the memory buffer associated with the the DELAY buffer. At 20 Mhz, the first partition (whichworks on 512 3 512 arrays) takes about half a video frameSHRINK operator. No further work can be done, because

the SHRINK operator must terminate a pipe and no other time. The remaining partitions work on 128 3 128 arraysand, hence, take only dQs of a frame time. The entire schedulenode is ready to execute. The second partition subtracts the

result of the SHRINK operation from the current contents of thus runs at 30 Hz, limited by the video frame rate.the DELAY buffer and sends it to the XWINDOW1 outputbuffer. At the same time, it sends the current contents of

The Scheduling Problemthe SHRINK buffer through the first lookup table LUT1, thederivative operator DXDY, and the two-input lookup table The scheduling problem can be formalized as follows:

We are given a directed acyclic graph G whose verticesLUT2. At this point the scheduler closes the pipe anddeposits the LUT2 result in temporary buffer TEMP1. It are taken from a set O representing the available operators.

FIG. 5. A VEIL schedule for the graph of Fig. 3. Shaded operators are implemented as MV20 memory buffers. The computation has beenbroken into three subgraphs, each of which executes as a single ImageFlow pipe.

Page 12: Programming a Pipelined Image Processor

362 OLSON, TAYLOR, AND LOCKWOOD

There is a compatibility relation on O; a given pair of substantial body of literature on approximation algorithmsfor problems of this type [25, 31, 39, 41, 51], and our originalvertices is compatible if the resource lists of the corre-

sponding VEIL operators are disjoint. Each vertex i has intent was to adapt one of these algorithms to the VEILscheduling problem. However, experiments (see below)associated with it an integer cost ci that specifies the num-

ber of pixels in the largest rectangle flowing through the have shown that a naive heuristic algorithm produces rea-sonably good schedules for the sort of problems that arisenode. The graph is to be scheduled for a system with M

memory banks and a switch that can support S simultane- in our work. Thus we have not yet had to resort to thesemore sophisticated strategies.ous connections. For the MV20, M and S have values six

and ten, respectively.A valid schedule for a graph G is a partition of the graph The Scheduling Algorithm

into a sequence of disjoint subgraphs g1 ? ? ? gk satisfyingThe current VEIL scheduler constructs an execution

the following constraints:schedule by traversing the graph in topologically sortedorder, starting at its sources. At each stage it attempts to(1) all ancestors of vertices in gj are found in subgraphspack as many of the remaining nodes as possible into theg1 ? ? ? gj . This guarantees that no operator can executecurrent pipe, subject to resource and topology constraints.before its input data are available.When no more nodes can be added to the current pipe, it(2) all vertices in each subgraph are mutually compati-connects the outputs of the pipe to temporary memoryble. This ensures that there are no resource conflicts be-buffers and creates a new empty pipe. When all nodestween vertices in the same subgraph and, hence, that allhave been scheduled, the temporary buffers are bound tooperators within a partition can execute simultaneously asphysical memory locations using a register coloring algo-part of the same ImageFlow pipe.rithm [2] and the partitions are used to generate a PAT.

(3) no subgraph contains more than S arcs, including The algorithm can be expressed concisely as follows:arcs that cross a partition boundary. This guarantees thatthere is enough switch bandwidth to route data between 1 mark all sources ‘ready’the computational elements and memory buffers that im- 2 REPEATplement the subgraph computation. 3 Create a new, empty current partition

4 WHILE there exists a ‘ready’ operator(4) every arc that crosses a partition boundary is as-that is compatible with the operatorssigned an integer in the range [1 ? ? ? M ], and no two inputin the current partition DOor output arcs for a single subgraph are assigned the same

5 Add it to the current partitioninteger. This restriction captures the constraint that each6 Check operators that use itsMV20 memory bank has one read port and one write port

outputs and mark ‘ready’ ifand, hence, can support at most one read and one writeappropriateper pipe.

7 FOR all output arcs from the currentSince the subgraphs that make up a schedule are exe- partition DO

cuted sequentially, the execution time of the entire sched- 8 Connect the arcs to temporaryule is the sum of the execution times for each subgraph. memory buffers.Pipeline delay and setup time are negligible for images of 9 UNTIL all operators have been scheduledreasonable size, so the execution time for a subgraph canbe approximated by the time required to send the largest The algorithm is simple and fast, but it ignores a greatpixel array in the subgraph through the switch. Thus an deal of potentially useful information. In particular, it addsoptimal-time schedule for a VEIL graph is one that obeys operators to the current partition in the order in which itthe above constraints and also minimizes the cost function encounters them. In some cases better schedules would be

produced by an algorithm that considered alternate orders,or that used heuristic criteria to decide which operator toOk

i51maxj[gi

(cj ).add next.

Scheduler PerformanceThe problem of finding optimal-time schedules for VEIL

graphs is NP-complete, like many other multiprocessor We evaluated the VEIL scheduler by comparing its out-put to that of an experimental branch-and-bound schedulerscheduling problems [39]. Simply finding a partition into

k compatible subgraphs is equivalent to k-coloring the that produces provably optimal schedules. We collected aset of 13 VEIL graphs developed in support of variouscomplement of the graph defined by the compatibility rela-

tion. Assigning integers to arcs that cross partition bound- projects around our lab, ranging in size from 9 to 39 opera-tors. Space does not permit a detailed description of all ofaries also involves a graph-coloring problem. There is a

Page 13: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 363

exceeds the optimal running time. For nine of the testgraphs the heuristic schedules had degenerate sample dis-tributions; the running times for all 1000 trials were identi-cal and were equal to the running time of the optimalschedule. For the remaining graphs, the heuristic runningtimes were always within 26% of optimal and were within10% on average. Execution times for the heuristic sched-uler were less than one second in all cases, compared withminutes or hours for the branch-and-bound scheduler.

It is somewhat surprising that the current schedulerworks as well as it does on the test graphs. If there is noconstraint on the compatibility relation, it is theoreticallypossible to construct graphs for which the running time ofthe heuristic schedule exceeds that of the optimal scheduleby an arbitrary amount. Figure 7 shows the constructionfor one such graph. In practice the finite bandwidth of theMV20 puts an upper bound on the degree of parallelismthat an optimal schedule can exhibit. The worst graph wehave been able to produce by following the constructionshown in Fig. 7 is one for which the heuristic scheduler’sworst-case running time exceeds that of the optimal sched-

FIG. 6. Phase-based horizontal motion estimator from the VEIL testule by 150%. The last entry in Table 1 summarizes theset. The convolvers extract sine-phase and cosine-phase components ofresults of 1000 trials on this graph. Fortunately the condi-the image at every point. The arctangent of the ratio of appropriate sums

of products of the components for the current and previous images is tions that produce this behavior seem to arise rarely inproportional to the horizontal rate of change of phase (see Burt et al. [14]). practice.

We speculate that the relatively good performance ofthe current VEIL scheduler is due to the fact that thetopological and resource constraints of the MV20 placethe graphs, but Fig. 6 shows a typical example, a graph

that uses phase differencing to estimate the horizontal com- tight limits on the space of valid schedules. The graphinduced by the compatibility relation is fairly sparse, lim-ponent of the motion field [14].

The VEIL scheduler adds nodes to the current pipe in iting the opportunity for parallel execution. The fact thatmany nodes (e.g., SHRINK) force termination of the currentthe order in which it encounters them. Thus its output

depends on the order in which nodes are created during pipe also limits parallelism. An architecture similar to theMV20 but offering more switch bandwidth and more po-graph construction. In order to take this into account, we

tested the scheduler on 1000 instances of each test graph, tential parallelism would probably increase the benefit tobe obtained by using a more intelligent scheduler. Im-randomly permuting the order of node creation each time.

For each instance we recorded the estimated running time proved scheduling heuristics are being investigated as partof our current work on VEIL.of the resulting schedule. Although the number of possible

node orderings is large, the number of topologically distinctScheduler Limitationsschedules that the scheduler can find is invariably small,

and there are usually several with identical running times. The current VEIL scheduler finds reasonably goodThus the sequence of running times observed in 1000 trials schedules for the graphs in the test set. However, it iscontains only a small number of distinct values. Dividing constrained to solve the problem as formalized above, thatthe number of occurrences of each value by one thousand is, to break the input graph into a series of partitions togives the relative frequency of that value, which is an esti- be executed in sequence. Modifying the definition of amate of the probability of observing that value on a ran- valid schedule would result in better performance in somedom trial. cases. One approach would be to relax the requirement

The entries above the final entry in Table 1 summarize that the graph partitions be executed in strict sequence.the results of the experiment for the thirteen test graphs. For example, it might be possible to execute a series ofFor the heuristic scheduler the table gives the sample distri- pipes that operate on small images in parallel with a singlebution over 1000 trials, expressed as the relative frequency pipe that operates on larger images. Another option wouldof each running time observed during the trials. The table be to optimize the schedule for repeated execution viaalso reports the sample mean running time for the heuristic software pipelining [7], as is done in (e.g.) the Gabrielschedules, the running time for the optimal schedule, and scheduler [31]. In a software pipelined schedule, multiple

iterations of a loop are executed in parallel. In the examplethe percentage by which the sample mean running time

Page 14: Programming a Pipelined Image Processor

364 OLSON, TAYLOR, AND LOCKWOOD

TABLE 1Scheduler Test Results

Heuristic schedule distribution Optimal Difference betweenschedule mean and optimal

Number Running Mean running times asRunning timeGraph name of nodes Frequency time running time per cent of optimal time

Tracker 9 1.0 14.75 14.75 14.75 0%Derivatives 10 1.0 29.49 29.49 29.49 0%Horswill 10 1.0 19.66 19.66 19.66 0%Temporal filter 12 1.0 39.32 39.32 39.32 0%Tracker 2 12 1.0 17.41 17.41 17.41 0%Change detect 12 1.0 13.52 13.52 13.52 0%Gaussian Pyr 13 1.0 22.39 22.39 22.39 0%Phase 14 1.0 32.77 32.77 32.77 0%Motion 14 1.0 52.43 54.43 54.43 0%Laplacian Pyr 16 0.219 26.74 30.33 26.74 13%

0.250 26.930.271 33.070.260 33.84

Disparity filter 24 0.490 42.60 44.3 42.60 4%0.510 45.87

2D phase 26 0.352 39.32 41.9 39.32 6.6%0.492 42.600.156 45.87

Motion filter 39 0.250 31.10 34.30 31.10 10.3%0.246 31.280.249 37.280.255 37.43

Ad hoc scheduler breaker 14 0.125 26.21 45.43 26.21 73%0.394 39.320.370 52.430.111 65.54

Note. Running times in milliseconds. The first 13 entries are for the test set. The final entry is for a graph designed specifically to provoke worst-case behavior.

of Fig. 5, for instance, the nth invocation of the first parti-tion could be executed in parallel with the (n 2 1)th invoca-tion of the third partition. The computation performed bythe graph would be unchanged, as would its latency, butits iteration period would be reduced. Note that softwarepipelining requires prologue and epilog code to handle theinitial and final iterations, however, and that in the generalcase it can increase latency.

We have not done formal experiments to determinehow much could be gained by redefining the schedulingproblem to allow parallel pipes or ‘‘wrapped’’ executionorder. Hand inspection of the graphs showed that allowingparallel pipes would reduce the running time of the dispar-ity filter graph by 7.7%, but it would not affect any of theother graphs. Permitting wrapped execution order savesFIG. 7. A class of graph for which the VEIL heuristic scheduler can

produce arbitrarily bad schedules. Suppose that nodes in the graph at 5% to 10% in most cases, because the last pipe of a scheduleleft are incompatible if and only if they have the same label. Clearly the can often be executed in parallel with the first. Permittingoptimal schedule contains exactly two partitions, as shown at center. If both parallel pipes and wrapped execution is significantlythe VEIL scheduler always examines ‘‘ready’’ nodes in left-to-right order,

more useful, yielding savings of 20% to 40% on four of thehowever, it will produce the schedule shown at right, in which the number13 test graphs. We caution that these numbers have notof partitions (and hence the running time) is proportional to the number

of nodes in the graph. been verified by implementing the schedules in question

Page 15: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 365

and, hence, are not as reliable as the simulation results Executive Limitationspresented in Table 1. We do believe that they are qualita-

The current version of VEIL does not halt the MV20tively correct. Their overall implication is that modifying

during data transfers or execution of user callback func-the scheduling paradigm would cut the running time of

tions. This makes the running time of a graph deterministic,some schedules nearly in half, but it would produce only

which is an advantage in many real-time vision applica-modest improvements in most cases.

tions. However, it also makes it possible for applicationsThe ultimate test of the VEIL scheduler would be to

that communicate with a running graph to receive corruptcompare its schedules to hand-coded ImageFlow. The com-

or inconsistent data, or for attribute changes to take effectputational elements of MV20 are deeply pipelined and

in the middle of a computation rather than between itera-contain complex internal datapaths that allow data to be

tions. This may occur if the host’s UNIX scheduler failsrouted between elements without going over the switch.

to run the user process in a timely fashion, or if the userVEIL cannot exploit this possibility because its operators

code fails to complete its task quickly. Whenever possibleare defined to accept input and produce output at switch

we prefer to build robust applications that can tolerateports. ImageFlow programmers are under no such restric-

errors of this kind. For nonrobust applications the blockingtion, so they should be able to improve on the VEIL sched-

Run() execution command can be used to enforce mu-uler in many cases. Unfortunately reimplementing the en-

tual exclusion.tire test set in ImageFlow would take months of full-time

The Run() command allows programs to synchronizeeffort and, hence, is not practical. We do have hand-coded

reliably at graph boundaries. In some cases, however, thisimplementations of the Gaussian and Laplacian pyramid

is inadequate. Consider a program that contains two orcomputations and have carefully inspected the other

more STATISTICS operators. Because there is only a singlegraphs to identify opportunities for optimization. For most

statistics unit on the MV20, each operator will execute ingraphs, including the pyramid computations, we have been

a separate pipe. In order to obtain valid data from the firstunable to make any improvement beyond what is available

operator, however, the application must read the data afterby permitting parallel pipes and wrapped schedules. For

the operator runs, but before the hardware statistics bufferthree graphs, including the ‘‘phase’’ graph of Fig. 6, it

is cleared in preparation for executing the second operator.appears possible to reduce the running times by 50% to

This level of service cannot be guaranteed by a user process60%. These savings would be obtained by embedding sum-

under UNIX. At present the only solution to this problemof-product subgraphs in the internal datapaths of the AU

is to put each statistics operator into a separate graph anddevice. Much of the speedup could be obtained within

use Run() repeatedly, which is awkward and unintuitive.VEIL by creating a special-purpose operator that imple-

We are currently testing an experimental version of thements the appropriate algebraic operation.

executive that uses events to guarantee exclusive access todata gathering operators of this type.6.4. The Executive

The job of the VEIL executive is to allow programs to 7. CONCLUSIONcontrol and communicate with graph computations. It isimplemented as a series of methods of the graph class. VEIL and MERLIN provide a powerful environment

for developing real-time vision systems. VEIL’s coarse-Most of the control operations (e.g., running or halting agraph computation) map directly to calls to the ImageFlow grained dataflow model allows the programmer to concen-

trate on the image processing task at hand, rather than thedevice driver. Communications and synchronization meth-ods make use of the ImageFlow event mechanism, which details of resource management, scheduling, and synchro-

nization. It also provides extensive facilities for interactionprovides semaphore-like variables that can be set, cleared,or waited for by either the application or the running PAT. and synchronization between the image processor and pro-

grams running on the host, making it easy to embed VEILUser-defined callbacks are handled by storing a pointer tothe callback function in the operator object and instructing computations into robot or autonomous vehicle control

programs. The VEIL scheduler produces efficient execu-the ImageFlow device driver to issue an event on comple-tion of the pipe containing the operator. When the applica- tion schedules that compare well with hand-coded Image-

Flow. The MERLIN interface allows VEIL graphs to betion invokes a blocking execution command or the asyn-chronous Update() command, the executive waits for the constructed and modified interactively, supporting explor-

atory programming.pipe events in order of issue and calls the appropriatecallbacks after each pipe. Operations that set run-time VEIL has a number of limitations, some of which are

correctable and others of which are fundamental conse-modifiable attributes or retrieve data from the MV20 areexecuted immediately if the graph is halted. If the graph quences of the design. As we observed in the previous

section, the scheduler ignores potentially useful informa-is currently executing, they are deferred until the nextoccurrence of the appropriate event. tion and considers only a restricted class of schedules. The

Page 16: Programming a Pipelined Image Processor

366 OLSON, TAYLOR, AND LOCKWOOD

underlying computational model is sometimes a problem pvision. We hope that VEIL will be of use to the visionresearch community and that it will help to establish aas well: computations that cannot be expressed as HSDF

graphs can be implemented by ‘‘escaping’’ to C11 code, higher standard for software tools for the next generationof high-speed image processors.but the mechanism is awkward and makes programs diffi-

cult to maintain. By far the most serious problem we en-counter in using VEIL stems from the fact that VEIL ACKNOWLEDGMENTSapplications run as user processes under UNIX. This places

The authors acknowledge the contributions of present and past mem-them at the mercy of the UNIX scheduler and makes itbers of the UVA Computer Vision group, particularly Worthy Martin,impossible to write programs that rely on meeting hardFrank Brill, Shawn Carnell, Scott Gietler, Glenn Wasson, and Jenniferreal-time constraints.Wong. We are also grateful to VEIL users around the world for providing

One of the secondary design criteria for VEIL was that valuable feedback and bug fixes, and to the technical support engineersit should have the potential to be ported to other image at Datacube for helping us understand ImageFlow. MaxVideo, Image-

Flow, and MV20 are trademarks of Datacube, Inc. This work was sup-processing architectures. The portions of VEIL that areported in part by the Defense Advanced Research Projects Agency underdirectly visible to applications programs (including MER-Grant N00014-94-1-0841 and in part by the National Science FoundationLIN) are largely device-independent and could be modifiedunder Grant CDA-8922545-01.

for use with another architecture fairly easily. The codethat implements the individual VEIL operators, on the

REFERENCESother hand, is intimately tied to the MV20 architecture.Fortunately these machine dependencies are well localized 1. W. Ackerman, Data Flow Languages, IEEE Comput. 15, No. 2and could be rewritten for a new architecture with little 1982, 15–25.effect on the rest of the system. The least portable parts 2. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Tech-of the system are the scheduler and executive, which would niques, and Tools, Addison–Wesley, Reading, MA 1986.need to be completely redesigned in order to work with a 3. J. Aloimonos, I. Weiss, and A. Bandyopadhyay, Active vision, in

Proceedings, 1st International Conference on Computer Vision, Lon-different architecture. It would not be surprising if thedon, 1987, pp. 35–54.operations provided by the VEIL executive turned out to

4. Amerinex Artificial Intelligence, Inc., KBVision Programmers Man-be inappropriate for some architectures, particularly thoseual, Amerinex Artificial Intelligence, Amherst, MA, 1987.that have their own CPUs and rely less on the host for

5. M. Annatarone, E. Arnould, T. Gross, H. T. Kung, M. Lam,general-purpose computation.O. Menzilcioglu, and J. A. Webb, The Warp computer: Architecture,Current work on VEIL is directed toward improving implementation, and performance, IEEE Trans. Comput. C-36,

its usability, particularly for users outside our laboratory. No. 12, 1987, 1523–1538.VEIL can now be used with the MV20’s successor, the 6. R. G. Babb II, Parallel processing with large-grain data flow tech-MV200. It can also be used in configurations containing niques, IEEE Comput. 17, No. 7, 1984, 55–61.multiple MV200 boards, although it requires the user to 7. D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler transformations

for high-performance computing, ACM Comput. Surveys 26, No. 4,specify on which board each operator should execute.1994, 345–420.MERLIN is being rewritten using Tcl/Tk [35] in order to

8. R. Bajcsy, Active perception, Proc. IEEE 76, No. 8, 1988, 996–1005.increase its speed and make it easier for remote users to9. D. H. Ballard, Reference frames for animate vision, in Eleventhcompile. We continue to work on the scheduler and to add

International Joint Conference on Artificial Intelligence, Detroit, Au-new operator types as the need arises.gust 1989, pp. 1635–1641.We are currently using VEIL on a daily basis in support

10. A. Biancardi and A. Rubini, PACCO—A new approach for an effec-of our robotics work and have distributed it to more thantive I. P. environment, in Proceedings 12th IAPR Int’l Conf. on Pattern

40 remote sites. We believe that it has more than met the Recognition, Jerusalem, October 1994, pp. 395–398.goals we established for it when we began the project. New 11. R. Blanford and S. Tanimoto, The PyramidCalc system for researchusers can indeed become productive within hours, and in pyramid machine algorithms, in Proceedings, Second International

Workshop on Visual Languages, Dallas, June 1986, pp. 138–142.experienced users feel little motivation to bypass it. It haslargely removed development time as a consideration in 12. J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt, Ptolemy: A

framework for simulating and prototyping heterogeneous systems,deciding whether to implement a new algorithm on theInt. Comput. Simul. 4, 1994, 155–182.MV20 or on a conventional workstation. Instead, the deci-

13. P. Burt and E. Adelson, The Laplacian pyramid as a compact imagesion tends to be driven (as it should be) by the intrinsiccode, IEEE Trans. Commun. 31, No. 4, 1983, 532–540.strengths and weaknesses of the machine, e.g., its orien-

14. P. Burt, J. Bergen, R. Hingorani, R. Kolczynski, W. Lee, A. Leung,tation toward pixel processing and its lack of floating-J. Lubin, and H. Shvaytser, Object tracking with a moving camera,

point arithmetic. in Proceedings, IEEE Workshop on Visual Motion, Irvine, MarchCurrent versions of VEIL and MERLIN are available 1988, pp. 2–12.

for anonymous ftp at ftp.cs.virginia.edu in directory pub/ 15. D. Coombs, I. Horswill, and P. von Kaenel, Disparity filtering: Prox-imity detection and segmentation, in Proceedings, SPIE Intelligentveil, or via world-wide web at http://www.cs.virginia.edu/

Page 17: Programming a Pipelined Image Processor

PROGRAMMING A PIPELINED IMAGE PROCESSOR 367

Robots and Computer Vision XI: Algorithms, Techniques, and Active 35. J. K. Ousterhout, Tcl and the Tk Toolkit, Addison–Wesley, Reading,MA, 1994.Vision, Boston, Nov. 1992, pp. 195–206.

36. R. Pausch, N. Young III, and R. Deline, SUIT, the PASCAL of user16. Datacube, Inc., Danvers, MA, WitFlow graphical programming tool,interface toolkits, Proceedings of UIST: the Annual ACM SIG-High Performance Imaging 6, No. 1, 1994.GRAPH Symposium on User Interface Software and Technology,17. Datacube, Inc., MaxVideo 20 Hardware Reference Manual, Datacube,November, 1991.Danvers, MA, 1991.

37. A. R. Pope and D. G. Lowe, Vista: A software environment for18. Datacube, Inc., ImageFlow Reference Manual, Datacube, Danvers,

computer vision research, in Proceedings, IEEE Comput. Soc.MA, 1991. Conf. on Computer Vision and Pattern Recognition, Seattle, 1994,

19. M. R. Garey and D. S. Johnson, Computers and Intractability: A pp. 768–772.Guide to the Theory of NP-Completeness, Freeman, New York, 1979. 38. J. R. Rasure and C. S. Williams, An integrated data flow visual

20. L. G. C. Hamey, J. L. Webb, and I-C. Wu, Low-level vision on WARP language and software development environment, J. Visual Lang.and the APPLY programming model, in Parallel Computation and Comput. 2, 1991, 217–246.Computers for Artificial Intelligence (J. S. Kowalik, Ed.), pp. 185–199, 39. V. Sarkar, Partitioning and Scheduling Parallel for Multiprocessors,Kluwer Academic, Boston, 1988. MIT Press, Cambridge, 1989.

21. R. M. Haralick, Y.-H. Yao, L. G. Shapiro, I. T. Phillips, A. K. Somani, 40. L. G. Shapiro, R. M. Haralick, and M. J. Goulish, INSIGHT: AJ.-N. Hwang, M. Harrington, C. Wittenbrink, C.-H. Chen, X. Liu, and dataflow language for programming vision algorithms in a recon-S. Chen, Proteus: Control and management system, in Proceedings, figurable computational network, Int. J. Pattern Recognit. Artif. Intell.CAMP ’93 Workshop on Computer Architectures for Machine Percep- 1, No. 3, 1987, 335–350.tion, New Orleans, December 1993, pp. 101–108. 41. G. C. Sih and E. A. Lee, A compile-time scheduling heuristic for

interconnection-constrained heterogeneous processor architectures,22. M. Hirakawa, S. Iwata, I. Yoshimoto, M. Tanaka, and T. Ichikawa,IEEE Trans. Parallel Distrib. Systems 4, No. 2, 1993, 175–187.HI-VISUAL iconic programming, in Proceedings, IEEE Computer

Society Workshop on Visual Languages, Linkoping, Sweden, 1987. 42. Silicon Graphics Computer Systems, Inc., IRIS Explorer User’s Guide,Document 007-1371-010, Silicon Graphics Computer Systems, Moun-23. I. D. Horswill and R. A. Brooks, Situated vision in a dynamic world:tain View, CA, 1992.Chasing objects, in Proceedings, AAAI 88, St. Paul, 1988, pp. 796–800.

43. C. V. Stewart and C. R. Dyer, Heuristic scheduling algorithms for24. J. M. Jagadeesh and Y. Wang, LabView, Computer 26(2), 1993,PIPE, in Proceedings, IEEE Workshop on Comp. Arch. for Pattern100–103.Analysis and Machine Intelligence, October, 1987.

25. H. Kasahara and S. Narita, Practical multiprocessor scheduling algo-44. M. J. Swain and M. Stricker (Eds.), Promising directions in activerithms for efficient parallel processing, IEEE Trans. Comput. C-33,

vision (written by the attendees of the NSF Active Vision Workshop,No. 11, 1984, 1023–1029.August 1991), University of Chicago Technical Report CS 91-27,

26. E. W. Kent, M. O. Shneier, and R. Lumia, PIPE—Pipelined image November 1991.processing engine, J. Parallel Distrib. Comput. 2, 1985, 50–78.

45. J. R. Taylor, R. J. Lockwood, T. J. Olson, and S. A. Gietler, A27. C. Kohl and J. Mundy, The development of the Image Understanding visual programming system for pipelined image processors, in SPIE

Environment, in Proceedings, IEEE Comp. Soc. Conf. on Computer Machine Vision Applications, Architectures, and Systems Integration,Vision and Pattern Recognition, Seattle, 1994, pp. 443–447. Boston, Nov. 1992, pp. 92–101.

28. M. S. Landy, Y. Cohen, and G. S. Sperling, HIPS: A Unix-based 46. S. L. Tanimoto, VIVA: A visual language for image processing,J. Visual Lang. Comput. 1, No. 2, 1990, 127–139.image processing system, Comput. Vision Graphics Image Process.

25, No. 3, 1984, 331–347. 47. D. G. Tilley, ZEBRA for MaxVideo: An application of object ori-ented microprogramming to register level devices, Department of29. D. T. Lawton and C. C. McConnell, Image understanding environ-Computer Science Tech. Rep. 315, University of Rochester, 1990.ments, Proc. IEEE 76(8), 1988, 1036–1050.

48. C. Upson, T. Faulhaber, D. Kamins, D. Laidlaw, D. Schlegel,30. E. A. Lee, W.-H. Ho, E. Goei, J. Bier, and S. Bhattacharyya, Gabriel:J. Vroom, R. Gurwitz, and A. van Dam, The application visualizationA design environment for DSP, IEEE Trans. Acoust. Speech Signalsystem: A computational environment for scientific visualization,Process. Nov. 1989.IEEE Comput. Graphics Appl., July 1989, 30–42.

31. E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous49. J. Webb, Steps toward architecture-independent image processing.data flow programs for digital signal processing, IEEE Trans. Comput.

IEEE Comput. 25, No. 2, 1992, 21–31.C-36, No. 1, 1987, 24–35.50. G. Wolberg, IMPROC: An interactive image processing software

32. E. A. Lee and D. G. Messerschmitt, Synchronous data flow, Proc. package, Technical Report CUCS-330-88, Department of ComputerIEEE 75, No. 9, 1987, 1235–1245. Science, Columbia University, 1988.

33. R. J. Lockwood, Heuristic Scheduling in VEIL, M.S. thesis, Depart- 51. T. Yang and A. Gerasoulis, DSC: Scheduling parallel tasks on anment of Computer Science, University of Virginia, Charlottesville, unbounded number of processors, IEEE Trans. Parallel Distrib. Sys-May 1994. tems 5, No. 9, 1994, 951–967.

34. T. J. Olson, N. G. Klop, M. R. Hyett, and S. M. Carnell, MAVIS: A 52. M. Zeltner, B. Schneuwly, and W. Guggenbuhl, How to programvisual environment for active computer vision, Proceedings, IEEE and configure a heterogeneous multiprocessor, in Proceedings, 1991Workshop on Visual Languages (VL’92), Seattle, September 1992, Workshop on Comp. Arch. for Machine Perception, Paris, December

1991, pp. 111–122.pp. 170–176.