HWSW Co-Design for FPGA Based Video Processing Platform

Archives Des Sciences Vol 65, No. 12;Dec 2012

504 ISSN 1661-464X

HW/SW Co-design for FPGA based Video Processing Platform

Yahia SAID (Corresponding author)

Laboratory of Electronics and Microelectronics (EμE)

Faculty of Sciences of Monastir ,

University of Monastir 5019, Tunisia

E-mail: [email protected]

Taoufik SAIDANI, Wajdi ELHAMZI, Mohamed ATRI

Laboratory of Electronics and Microelectronics (EμE)

Faculty of Sciences of Monastir,

University of Monastir 5019, Tunisia

E-mail: [email protected], [email protected], [email protected]

Abstract

In this paper we present a Video Processing Platform (VPP) for rapid prototyping based on FPGA (Field

Programmable Gate Arrays) architecture using EDK embedded system and Xilinx System Generator.

This hardware/software co-design platform has been implemented on a Xilinx Spartan 3A DSP FPGA. The

video interface blocks are done in RTL and the MicroBlaze soft processor is used as an embedded video

controller. This paper discusses the architectural building blocks showing the flexibility of the proposed

platform. This flexibility is achieved by using a new design flow based on Xilinx System Generator. This

Video Processing Platform allows custom-processing blocks to be plugged-in to the platform architecture

without modifying the front-end (capturing video data) and back-end (displaying processed output). This

paper presents several examples of video processing applications, such as a Prewitt edge detector and video

wavelet coding that have been realized using the Video Processing Platform (VPP) for real-time video

processing.

Keywords: Field Programmable Gate Arrays (FPGA), Real Time Video Processing, Embedded

Development Kit (EDK), System Generator (XSG).

1. Introduction

Image and video processing are an ever expanding and dynamic areas with applications reaching out into

our everyday life such as in medicine, astronomy, ultrasonic imaging, remote sensing, space exploration,

surveillance, authentication, automated industry inspection and in many more areas [1].

Reconfigurable hardware in the form of Field Programmable Gate Arrays (FPGAs) has been proposed as

a way of obtaining high performance for Image Processing, even under real time requirements [2].

Implementing image processing algorithms on reconfigurable hardware minimizes the time-to-market cost,

enables rapid prototyping of complex algorithms and simplifies debugging and verification. Therefore,

FPGAs are an ideal choice for implementation of real time image processing algorithms [3].


505 ISSN 1661-464X

With the evolution of FPGA architecture, it has in build processor for designing reconfigurable embedded

system. The design involves use of processor, hardware logic IP and its integration. This is termed as

System on Chip (SoC) design [4].

The Xilinx Embedded Development Kit (EDK) is offered for SoC design platform. It provides a rich set

of tools like Software development kit (SDK) to develop embedded software application and Xilinx

platform studio (XPS) for hardware development and with a wide range of embedded processing

Intellectual Property (IP) cores including processors and peripherals. Integrating all the cores with

processors inside the FPGA leads to reconfigurable embedded processor system [5].

The introduction of high level hardware system modeling tools has further accelerated the design of image

processing in FPGA. The Xilinx System generator (XSG) offers a new design methodology that uses a

model based approach for design and implementation of Digital Signal Processing (DSP) applications in

FPGA [6].

XSG is an important design tool which is an extension of Simulink and consists of a Simulink library

called the Xilinx blockset that can be mapped directly into target FPGA hardware. XSG provides the

functionality for performing co-simulation for designs that run both in hardware and in software which

make it possible to complete even very long simulations within a much shorter period of time [6]. Figure 1

shows a design flow using the XSG.

The software automatically converts the high level system block diagram to RTL. The result can be

synthesized to Xilinx FPGA technology using ISE tools. All of the downstream FPGA implementation

steps including synthesis and place and route are automatically performed to generate an FPGA

programming file.

Figure 1. XSG based design flow for hardware implementation


506 ISSN 1661-464X

System Generator provides a system integration platform for the design of video processing system on

FPGAs that allows the RTL, Simulink, MATLAB, and C/C++ components of a DSP system to come

together in a single simulation and implementation environment. It also supports a black box block that

allows RTL to be imported into Simulink and co-simulated.

System Generator constructs the VHDL design of the model, generates a pcore for this model and integrates

it with the hardware/software platform in the XPS project. The EDK Processor IP block provides an

interface to MicroBlaze and Custom logic being developed in XSG. In this approach export IP core

technique is used for designing SoC system [6].

The Xilinx Embedded Development Kit (EDK) tools make it possible to implement a complete video

processing system on a single FPGA using hardware/software codesign methods. In this approach, custom

image/video processing modules developed in System Generator can be integrated as a dedicated hardware

peripheral to the existing framework.

The objective of this work is to develop a real-time video processing platform (VPP) with an input from

a CMOS camera and output to a DVI display and verified the results video in real time. This platform

provides rapid development of image and video processing algorithms: Model-based designs developed

with XSG are converted to hardware blocks that can be incorporated easily into VPP.

This paper is organized as follows: Section 2 describes the Platform design overview. Section 3 presents

two examples of video processing applications developed with XSG which are a Prewitt edge detector and

video wavelet coding. Finally, a brief conclusion and directions for future work are given in Section 4.

2. Overal Platform Design

The board used for VPP is the VSK Spartan 3A-DSP Platform developed by Xilinx [7]. This board has

Xilinx Spartan-3A DSP XC3SD3400A-4FGG676C FPGA with 53,712 logic cells, 126 DSP48A Slices, and

2,268Kb of block ram (BRAMs).

The board has an add-on card: the FMC-Video IO daughter card that augments the video capabilities of the

Video Processing Platform. The FMC-Video includes camera interface to allow the capture of data from a

custom camera based on a Micron MT9V022 Digital CMOS color image sensor [8].

Images with 8 or 10 bits per pixel, 742H by 480V, 60 frames per second are captured by the high

performance MT9V022 image sensor's 10 bit A-D converter and serialized for image transmission [9].

The data stream from the camera is in the form of a high-speed LVDS data stream. This stream is

received and deserialized using a National DS92LV1212A deserializer. This is capable of carrying LVDS

data from a camera which has a pixel rate of 26.6 MHz [8].

This board is ideal for a video processing platform since it has all the hardware necessary to capture and

display the data on a monitor. Video data are captured from the camera at a resolution of 742x480P at 60Hz.

Then these data are sent through a Gamma block for data correction, and then on to the Video to VFBC, so

that we only send the active data into the MPMC. The default is a 3-frame buffer, and a simple sync signal

that is connected between the Video to VFBC and the Display Controller to make sure that we read out one

frame behind what is being written into the external memory. The display controller then reads data out of

memory and passes it to the DVI out.


507 ISSN 1661-464X

We have built a flexible architecture that enables real-time image and video processing. The overview of

the design is given in Figure 2

Figure 2. Platform design overview

The complete streaming video application includes Video interfaces, a run-time configurable processing

blocks and a real-time video processing block. The system is controlled by a MicroBlaze processor [10]

that initializes the VPP peripherals and Controls the Video Processing and Frame Buffer Pipelines by

reading and writing control registers in the system.

The MicroBlaze soft processor core is a 32-bit Harvard Reduced Instruction Set Computer (RISC)

architecture optimized for implementation in Xilinx FPGAs with separate 32-bit instruction and data buses

running at full speed to execute programs and access data from both on-chip and external memory at the

same time [10]. It is used as an embedded video controller in this design. The block diagram of MicroBlaze

is shown in Figure 3.

The peripherals are connected to the Embedded MicroBlaze processor through Processor Local Bus

(PLB). The Processor is connected to dual-port SRAM, called Block RAM (BRAM), using a dedicated

Local Memory Bus (LMB). This bus features separate 32-bit wide channels for program instructions and

program data, using the dual-port feature of the BRAM. The LMB provides single-cycle access to on-chip

dual-port Block RAM.


508 ISSN 1661-464X

Figure .3 MicroBlaze Core Block Diagram

The complete video system is created using the Xilinx Embedded Development Kit (EDK) [5] and

System Generator for DSP [6]. The Embedded Development Kit is an integrated development

environment for designing embedded processing systems. System Generator is a system-level modeling

tool from Xilinx that facilitates FPGA hardware design. It can automatically generate accelerator blocks in

the form of a custom peripheral for the embedded video application that allows the MicroBlaze processor to

read and write shared memories in the customized video accelerators.

The synthesis results of the overall system are given in Table 1. VPP uses few resources of the FPGA;

hence space is available for additional logic such as image and video processing applications.

Table 1. The synthesis results of the proposed platform

Resource Type Used Available %

Slices 7810 23872 33%

Slice Flip Flops 9706 47744 20%

4 input LUTs 11170 47744 24%

bonded IOBs 78 469 17%

BRAMs 64 126 50%

DSP48s 3 126 3%


509 ISSN 1661-464X

3. Case Study Using Xilinx System Generator

Two video processing applications have been designed and developed using Xilinx System Generator. A

Prewitt edge detector and video wavelet coding blocks have been designed and tested with VPP, as

previously described. In this section, output images are real-time video results of the different hardware

components generated by System Generator.

3.1 Prewitt Gradient Edge Detector

Edges characterize boundaries as well as giving the information of the location objects, shape, size, and

object textures. Therefore, edge detection has a fundamental importance in image processing. Edges in

images characterize object boundaries and are therefore useful for segmentation, registration, and

identification of objects in a scene.

Edge detection refers to the process of identifying and locating sharp discontinuities in an image [11]. The

discontinuities are abrupt changes in pixel intensity which characterize boundaries of objects in a scene.

The most well known technique for edge detection involves convolving the image with a 2-D filter, which

is constructed to be sensitive to large gradients in the image while returning values of zero in uniform

regions [12].

Prewitt is gradient based edge detection algorithm which performs a 2-D spatial gradient measurement

on the video data. It uses two 3X3 kernels to convolve with the original image. Hence, all of the edges in an

image, regardless of direction, can be detected by implementing the sum of two directional edge

enhancement operations.

First, RGB data are converted into grayscale to obtain image intensity, using the following equation:

(1)

The kernels are then applied separately to the image intensity, to produce separate measurements of the

gradient component in each orientation (called Gx and Gy) as shown in (2).

and (2)

These can then be combined together to find the absolute magnitude of the gradient at each point and the

orientation of that gradient as follow:

and (3)

The Prewitt edge detector is build as a video processing accelerator, using System Generator for DSP and

Simulink. The design of our filter is shown in Fig. 4.


510 ISSN 1661-464X

Figure 4. Prewitt IP core in system generator

The System Generator design contains an EDK Processor block that can be exported as an EDK pcore

using the EDK Export Tool compilation target. The export process creates a PLB-based pcore, which is

integrated to the Microblaze 32 bit soft RISC processor with the Xilinx Platform Studio (XPS) [6].

In the VPP setup a DVI display shows the output edge from the camera. Experimental setup for

implementation of Prewitt edge detection is presented in Figure 5.

Figure 5. Experimental setup for implementation of edge detection. Input

is from CMOS camera and the output is on a DVI display.


511 ISSN 1661-464X

The total resource usage for the system, including the MicroBlaze, bus structure, the Prewitt edge core and

peripherals, is 9096 slices, equaling 38% of the FPGA’s total resources. Table 2 shows the amount of logic

used for the Prewitt edge module. The post-synthesis resource usage of this module is 5%. It has a

post-synthesis maximum estimate frequency of 68.432MHz.

Table 2. Post-synthesis device utilization for the Prewitt Edge Pcore


Slices 1286 23872 6%


4 input LUTs 1710 47744 4%


BRAMs 5 126 3%

DSP48s 4 126 3%

Maximum Frequency 68.432 MHz

3.2 Discrete Wavelet Transform

Discrete Wavelet Transform (DWT) is a broadly used digital signal processing technique with application

in diverse areas such as digital speech recognition, feature extraction, multi-resolution video processing and

data compression [13]. DWT, originally implemented through Mallat’s filterbank algorithm [14], has been

rendered more efficient by the development of the lifting scheme that has been incorporated in the JPEG

2000 image compression standard.

The lifting scheme entirely relies on the spatial domain, has many advantages compared to filter bank

structure, such as lower area, power consumption and computational complexity.

Lifting has other advantages, such as “in-place” computation of the DWT, integer-to-integer wavelet

transforms which are useful for lossless coding. The lifting scheme has been developed as a flexible tool

suitable for constructing the second generation wavelets. It is composed of three basic operation stages:

split, predict and update (Figure 6).

Split

+

Prediction Up dating

+

K

K-1

Image

Figure 6. Lifting scheme forward transform


512 ISSN 1661-464X

The implementation of lifting schemes is decomposed of two levels 2D-DWT, it may be computed using

filter banks as shown in Figure 7. The input samples X(n) are approved through two stages of analysis

filters.

They are first processed by low-pass (h(n)) and high-pass (g(n)) horizontal filters and are sub sampled by

two. Subsequently, the outputs (L1, H1) are processed by low-pass and high-pass vertical filter. Note that:

L1, H1 are the outputs of 1D-DWT; LL1, LH1, HL1 and HH1 one-level decomposition of 2D-DWT. From

the earlier structure, for a separable 2D-DWT with N levels of transformation, it can be easily achieved by

concatenation of 1D-DWT units, with the first stage processing N transformation levels on rows and the

second one with N transformation levels on columns. For image compression purposes, JPEG 2000

recommends an alternate row/column based structure as the one presented in Figure 7.The sub-band

decomposition of an image when the standard 2D-DWT with two transformation levels is presented in

Figure 8. “H” and “L” correspond to high and low-pass filter stages, respectively.

Figure 7. Lifting scheme decomposition of 5/3 filter

Hn 2

2Gn

Hn

Gn

Hn

Gn

2

2

2

2

X(n)

L1

H1

LL1

LH1

HL1

HH1

Hn 2

2Gn

Hn

Gn

Hn

Gn

2

2

2

2

L2

H2

LL2

LH2

HL2

HH2

Horizontal filter Vertical Filter

Horizontal filterVertical Filter

Figure 8. Subband decomposition for two-level 2D-DWT


513 ISSN 1661-464X

The design of the DWT 2D Codec in System Generator is shown in Figure 9. Experimental results of

DWT2D codec implementation is presented in Figure 10.

Figure 9. DWT2D IP core in system generator

Figure 10. Experimental setup for implementation of DWT2D Codec

The total resource usage for the system, including the MicroBlaze, bus structure, DWT2D Codec Pcore and

peripherals, is 8833 slices, equaling 38% of the FPGA’s total resources. Table 3 shows the amount of logic

used for the DWT2D Codec module. The post-synthesis resource usage of this module is 5%. It has a

post-synthesis maximum estimate frequency of 65,167 MHz.


514 ISSN 1661-464X

Table 3. Post-synthesis device utilization for the DWT2D Codec Pcore


Slices 1023 23872 5%


4 input LUTs 1323 47744 3%


BRAMs 3 126 3%

DSP48s 4 126 4%

Maximum Frequency 65.167 MHz

4. Conclusion

Continual growth in the size and functionality of FPGAs over recent years has resulted in an increasing

interest in their use as implementation platforms for image processing applications, particularly real-time

video processing [15].

In this work, we have presented a video processing platform (VPP) for real-time video processing

application. This platform provides a development environment that allows designers to quickly begin to

experiment with video processing using the Spartan-3A DSP family of FPGAs. An embedded base system

shipped with the VSK [7], provides a familiar starting point from which existing processor-based video

applications can be ported, or new designs created. The user can build flexible video processing systems

that include embedded processors and customized video accelerators and verify video hardware designs in

a fraction of the time using hardware co-simulation provided by System Generator.

Two applications have been presented showing the performance and flexibility of the proposed platform.

For the Prewitt edge detection system architecture, including the MicroBlaze, bus structure, the Prewitt

edge core and peripherals, the total resource usage is 9096 slices, equaling 38% of the FPGA’s total

resources. It has a post-synthesis maximum estimate frequency of 88.547MHz.

The DWT2D codec system architecture has 85.292 MHz maximum frequency and uses 8833 CLB slices

with 38% utilization, so there is possibility of implementing some more parallel processes with this

architecture on the same Platform.

The Xilinx System Generator tool, offers an efficient and straightforward method for transitioning from a

PC-based model in Simulink to a real-time FPGA based hardware implementation. Custom video

accelerator blocks are captured in the DSP friendly Simulink modeling environment, converted into custom

peripherals for Platform Studio and then connected to the embedded system using the processor local bus.

Future works include the use of the Xilinx System Generator and EDK development tools for the

implementation of a computer vision application: object detection and tracking system on the proposed

Platform.


515 ISSN 1661-464X

References

[1] Russ J. C, The Image Processing Hand book, Sixth Edition, CRC Press, 2011.

[2] D.Crookes, “Design and implementation of a high level programming environment for FPGA-based image processing,” IEEE Proceedings on Vision, Image, and Signal Processing, vol 4, 2000.

[3] D.V.Rao, S.Patil, N.A.Muthukuma, “Implementation and Evaluation of Image Processing Algorithms on Reconfigurable Architecture using C-based Hardware Descriptive Languages,” International Journal of Theoretical and Applied Computer Sciences, pp.9-34, 2006.

[4] R.Peesapati, S. Sabat, K.Venu , “ Automatic IP Core generation in SoC,” International Journal of Recent Trends in Engineering, Vol 2, No. 6, 2009

[5] Xilinx Inc. Embedded System Tools Reference Manual, http://www.xilinx.com

[6] Xilinx System Generator user Guide, http://www.xilinx.com

[7] Spartan-3A DSP FPGA Video Starter Kit user Guide, http://www.xilinx.com

[8] Xtreme DSP Solution FMC-Video Daughter Board Technical Reference Guide, http://www.xilinx.com

[9] Micron MT9V022 CMOS image sensor product brief, http://www.micron.com

[10] MicroBlaze soft processor, http://www.xilinx.com

[11] J.Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal.Mach. Intell, vol. PAMI-8, no.6, pp. 679-698, Jum.1986.

[12] S.Behera, M.N.Mohanty, S.Patnaik, “A Comparative Analysis on Edge Detection of Colloid Cyst: A Medical Imaging Approach,” Soft Computing Techniques in Vision Science, Studies in Computational Intelligence, Springer, Volume 395, pp 63-85 , 2012.

[13] D.S.Taubman, M.W.Marcellin, JPEG2000, Image Compression Fundamentals, Standards and Practice, Kluwer Academic Publishers, ch.6, 2002.

[14] S.Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell, Vol 11, pp. 674-693,1989

[15] B.Hutchings, J.Villasenor, “The Flexibility of Configurable Computing,” IEEE Signal Processing Magazine,vol15, pp. 67–84,1998.

HWSW Co-Design for FPGA Based Video Processing Platform

Documents

total resource usage

archives des sciences

customized video accelerators

synthesis resource usage

prewitt edge core

prewitt edge detector

embedded development kit

video wavelet coding