Image Compression Dwt Project Report

Modified DA based DWT-IDWT on FPGA for Image Compression

Dept of E&C, Sir MVIT, Bengaluru Page 1

CHAPTER 1

INTRODUCTION



1.1 IMAGE

An image (from Latin imago) is an artifact, for example a two-dimensional

picture, that has a similar appearance to some subject—usually a physical object or a

person

Images may be two-dimensional, such as a photograph, screen display, and as

well as a three-dimensional, such as a statue. They may be captured by optical devices—

such as cameras, mirrors, lenses, telescopes, microscopes, etc. and natural objects and

phenomena, such as the human eye or water surfaces.

The word image is also used in the broader sense of any two-dimensional figure

such as a map, a graph, a pie chart, or an abstract painting. In this wider sense, images can

also be rendered manually, such as by drawing, painting, carving, rendered automatically

by printing or computer graphics technology, or developed by a combination of methods,

especially in a pseudo-photograph.

A volatile image is one that exists only for a short period of time. This may be a

reflection of an object by a mirror, a projection of a camera obscura, or a scene displayed

on a cathode ray tube. A fixed image, also called a hard copy, is one that has been

recorded on a material object, such as paper or textile by photography or digital

processes[1].

1.1.1 STILL IMAGE

A still image is a single static image, as distinguished from a moving image (see

below). This phrase is used in photography, visual media and the computer industry to

emphasize that one is not talking about movies, or in very precise or pedantic technical

writing such as a standard. A film still is a photograph taken on the set of a movie or

television program during production, used for promotional purposes.

http://en.wikipedia.org/wiki/Subject_(philosophy)

http://en.wikipedia.org/wiki/Person

http://en.wikipedia.org/wiki/Dimension

http://en.wikipedia.org/wiki/Photograph

http://en.wikipedia.org/wiki/Statue

http://en.wikipedia.org/wiki/Optics

http://en.wikipedia.org/wiki/Camera

http://en.wikipedia.org/wiki/Mirror

http://en.wikipedia.org/wiki/Lens_(optics)

http://en.wikipedia.org/wiki/Telescope

http://en.wikipedia.org/wiki/Microscope

http://en.wikipedia.org/wiki/Eye

http://en.wikipedia.org/wiki/Map

http://en.wikipedia.org/wiki/Graph_(data_structure)

http://en.wikipedia.org/wiki/Pie_chart

http://en.wikipedia.org/wiki/Abstract_art

http://en.wikipedia.org/wiki/Drawing

http://en.wikipedia.org/wiki/Painting

http://en.wiktionary.org/wiki/carving

http://en.wikipedia.org/wiki/Printing

http://en.wikipedia.org/wiki/Computer_graphics

http://en.wikipedia.org/wiki/Image_development_(visual_arts)

http://en.wikipedia.org/wiki/Pseudo-photograph

http://en.wikipedia.org/wiki/Camera_obscura

http://en.wikipedia.org/wiki/Cathode_ray_tube

http://en.wikipedia.org/wiki/Hard_copy

http://en.wikipedia.org/wiki/Paper

http://en.wikipedia.org/wiki/Textile

http://en.wikipedia.org/wiki/Photography

http://en.wiktionary.org/wiki/static

http://en.wikipedia.org/wiki/Photography

http://en.wikipedia.org/wiki/Electronic_media

http://en.wikipedia.org/wiki/Computer_industry

http://en.wikipedia.org/wiki/Standardization

http://en.wikipedia.org/wiki/Film_still



Figure 1.1 : Still Image.

1.1.2 MOVING IMAGE

A moving image is typically a movie (film), or video, including digital video. It

could also be an animated display such as a zoetrope.

1.1.3 IMAGE FILE SIZE

Image file size—expressed as the number of bytes—increases with the number of

pixels composing an image, and the colour depth of the pixels.

The greater the number of rows and columns, the greater the image resolution, and

the larger the file. Also, each pixel of an image increases in size when its colour depth

increases—an 8-bit pixel (1 byte) stores 256 colors, a 24-bit pixel (3 bytes) stores 16

million colaors, the latter known as true color[1].

1.2 IMAGE COMPRESSION Image compression, the art science of reducing the amount of data required to

representation image, is one of the most useful and commercially successful technologies

in tke field of digital image processing. The number of images that are compressed and

decompressed daily is staggering, and the compressions and decompressions are virtually

invisible to the user. Anyone who owns a digital camera, surfs the web, or watches the

latest Hollywood movies on digital video disks(dvds) benefits from the algorithms and

standards discussed in this section[2].

Compression is basically of two types:

Lossy Compression

Lossless Compression.

http://en.wikipedia.org/wiki/Film

http://en.wikipedia.org/wiki/Video

http://en.wikipedia.org/wiki/Digital_video

http://en.wikipedia.org/wiki/Zoetrope

http://upload.wikimedia.org/wikipedia/commons/2/20/A_photographer_on_the_ruins_of_Sutro_Bath.jpg



Lossy compression of data concedes a certain loss of accuracy in exchange for

greatly increased compression. An image reconstructed following lossy compression

contains degradation relative to the original. Often this is because the compression

scheme completely discards redundant information. Under normal viewing conditions no

visible is loss is perceived. It proves effective when applied to graphics images and

digitized voice.

Lossless compression consists of those techniques guaranteed to generate an

exact duplicate of the input data stream after a compress or expand cycle. Here the

reconstructed image after compression is numerically identical to the original image.

Lossless compression can only achieve a modest amount of compression. This is the type

of compression used when storing data base records, spread sheets or word processing

files[2].

1.2.1 NEED FOR THE COMPRESSION

To better understand the need for compact image representations, consider the

amount of data required to represent a two-hour standard definition(SD) television movie

using 720*480*24 bit pixel arrays. A digital movie (or video) is a sequence of video

frames in which each is a full-color still image. Because video players must display the

frames sequentially at rates near 30 fps (frames per second),SD digital video data must be

accessed at

(30 frames/sec)*(720*480pixels/frame)*(3 bytes/pixel)=31,104,000bytes/sec

And a two-hour movie consists of

(31,104,000 bytes/sec)*(3600 sec/hour)*(2 hours)=2.24*10^11 bytes

Or 224 GB (giga bytes) of data. Twenty seven 8.5 GB dual layer DVDs (assuming

conventional 12 cm disks) are needed to store it. To put a two-hour movie on a single

DVD, each frame must be compressed-on average by a factor of 26.3.The compression

must be even higher for High Definition(HD)television. where image resolutions reach

1929*1080*24 bit/image[1].



1.3 OVERVIEW

Figure 1.2 Block Diagram

1.3.1 Experimental Setup

Figure 1.3: Experimental Setup



1.3.2 RESOURCES USED:

Xilinx IST

Matlab

Virtex 2 pro FPGA Development Kit

Desktop PC

Interfacing Model

1.3.3 OBJECTIVE

1) To carry out literature survey on

a) Image and Image Compression

b) Need for Compression

c) JPEG Standard

d) DWT-IDWT

e) DA Arithmetic

f) Real Time Setup for Image Compression

2) To develop system level block diagram for Image Compression and DWT-IDWT

processor

3) To develop software reference level for Image Compression and analyse the results

for multiple test images

4) To design and implement DA DWT-IDWT processor and analyze its performance

w.r.t area, time and power on FPGA

5) To design Modified DA DWT-IDWT processor and analyses its performance

6) To implement the proposed architecture on FPGA and verify the results in real time

experimental setup

1.4 APPLICATIONS:

Although the Fourier transform has been the mainstay of transform-based digital

signal processing since time immemorial, a more recent transformation, called the

wavelet transform, is making strides in DSP applications following some of its unique

advantages.

Wavelets have their energy concentrated in time. Sinusoids (Fourier Transform)

are useful in analyzing periodic and time-invariant phenomena, while wavelets are well

suited for the analysis of transient, time-varying signals. Since most of the real-life



signals encountered are time varying in nature, the Wavelet Transform suits very well for

many applications[4].

1.4.1 Wavelets in Audio

DWT can be used to analyze temporal and spectral properties of non-stationary

signals such as audio. Unlike the Fourier transform, whose basic functions are sinusoids,

wavelet transforms are based on small waves, called wavelets, of varying frequency and

limited duration. That reveals not only what notes (or frequencies) to play but also when

to play them. Conventional Fourier transforms, on the other hand, provide only the notes

or frequency information; temporal information is lost in transformation process.

Some of audio applications where DWT could offer considerable improvement

are extraction of beat attributes from music signals and automatic classification of non-

speech audio signal using statistical pattern recognition. Shrinking of transform

coefficients towards zero in wavelet domain is one of the wavelet techniques, which

offers advantage of removal of noise in wide variety of signal types while preserving non-

smooth features.

1.4.2 Wavelets in Video

Wavelet basis functions are obtained from single wavelet by transformation and

scaling of mother wavelets. Also, multi-resolution concept, satisfied by almost all useful

wavelet functions, makes it very useful in analyzing “real world” signals.

Multi-resolution theory is concerned with the representation and analysis of

signals at more than one resolution. The multi-resolution of videos has an advantage of

scalability. i.e. possibility to transmit the same sequence at different resolution as high-

resolution television, videophone and videoconferencing. DWT offers better

approximation at half the width and half as wide translation steps. This is conceptually

similar to improving frequency resolution by doubling the number of harmonics in

Fourier series expansion.

While DCT-based image coders like JPEG perform very well at moderate bit

rates, at low bit rates the image quality degrades rapidly because of the blocking artifacts

introduced by the block based DCT transform. JPEG-2000 is an emerging standard in



image processing that uses DWT to achieve far superior image quality at very low bit

rates because of overlapping basis functions and better energy compaction property of

wavelet transformation.

1.4.3 Wavelets in Wireless applications

The analysis, design and measurement of antennas have been extremely important

in the development and success of wireless communication and applications.

Unfortunately mathematical simulations of antenna are extremely complex and require

extensive computation and large amount of memory. Use of wavelets in conjunction with

other techniques in the numerical methods involved in solving the current distribution on

the antenna offers many advantages. The use of wavelets in such simulations propose

reduction in computation, aids in reducing errors as well as enables us to get closer to the

true values of such computation.

With the recent developments in wireless communication technologies, video

streaming and the image compression techniques are very important for wireless

application to transmit multimedia content over wireless channels. As wireless channels

are very noisy and have narrow bandwidth, higher compression is required for both image

and video signals, use of wavelet transform as image compression technique in wireless

applications could be a good choice because of its advantage of providing better

compression at higher bit rates.

1.4.4 Wavelets in Neural Networks

Neural Networks (NN) have emerged as a powerful tool for data mining

applications due to their ability to learn patterns and relationships in complex, multi-

dimensional data sets. The effectiveness of any NN-based solution is largely dependent

on a range of factors such as scalability of the network, generalization capability,

dimensionality of the parameter space and host of other factors and often restrict the

effectiveness of the NN. As such, any methods, which are able to increase the quality or

accessibility of the input data, will be invaluable. It is here that wavelets are likely to be

extremely useful. NN‟s are useful in conjunction with wavelets, with the latter serving as

a preprocessing tool that transforms hidden patterns into a more recognizable form

suitable for use as a training set



CHAPTER 2

IMAGE COMPRESSION STANDARD



2.1 NEED FOR A COMPRESSION STANDARD

With the rapid developments of imaging technology, image compression and

coding tools and techniques, it is necessary to evolve coding standards so that there is

compatibility and interoperability between the image communication and storage

products manufactured by different vendors. Without the availability of standards,

encoders and decoders can not communicate with each other; Most commonly used

standards are JPEG and JPEG2000[3].

2.2 JPEG

The aim of JPEG compression is to take full-color (and gray-scale) "real-world"

scenes and reduce the file size of images for storage and transmission.

While capacity and bandwidth have improved dramatically over the last decade,

the increased size of images makes JPEG still relevant for digital cameras users and

websites.

This standard doesn't define exactly how to implement this process, but is

sufficiently wide that images from any program can be viewed. The most common

version in use is that produced by the Independent JPEG Group or IJG[3].

2.2.1 Need for JPEG

To make your image files smaller, and to store 24-bit-per-pixel color data instead

of 8-bit-per-pixel data. Advantage of JPEG is that it stores full color

information:24bits/pixel

2.2.2 JPEG STANDARD

In computing, JPEG (named after the Joint Photographic Experts Group who

created the standard) is a commonly used method of lossy compression for photographic

images.

The degree of compression can be adjusted, allowing a selectable trade off

between storage size and image quality.

JPEG (.jpeg, .jfif, .jpg and .jpe) is a standard image compression format developed

by and named after the Joint Photographic Experts Group.

http://photo.net/learn/jpeg/#ijg

http://en.wikipedia.org/wiki/Computing

http://en.wikipedia.org/wiki/Joint_Photographic_Experts_Group

http://en.wikipedia.org/wiki/Lossy

http://en.wikipedia.org/wiki/Image_compression



It is one of the two most common formats for storing and sending images on the

Web. JPEG images are full-color images, meaning they are capable of storing 24 bits-per-

pixel and using 16 million colors[3].

JPEG is best for compressing full-color or gray-scale images, including

photographs and graphic images.

The JPEG format is unique in the aspect that images are compressed based on the

human eye. Because the human eye does not pick up subtle color distinctions and high

frequency brightness variations, data can be removed without completely changing the

image. However, as this data is removed the quality of the image decreases. This is the

reason JPEG compression is considered “lossy”.

Edges in a typical JPEG image - split by red, green and blue channels

Figure 2.1:Image describing JPEG standard.

As with all image compression formats, JPEG has both its advantages and disadvantages:

2.2.3 ADVANTAGES OF JPEG

Large compression ratios = shorter file transfer time

Full-color information

Great for photographs, graphic artwork, banner ads, etc

2.2.4 DISADVANTAGES OF JPEG

Loss of image quality

Sharp edges tend to come out blurry

Longer page load time than the GIF Format



JPEG uses a lossy compression algorithm so you will lose some detail when

converting other formats like BMP to a JPEG

If you have an illustrated image or a vector image, don't use JPEG because the

edges of lines may get blurred.

2.2.5 EMERGENCE OF A JPEG 2000

JPEG 2000 addresses most of the problems:

The biggest problem is that JPEGs are lossy - when an image is converted to

JPEG, some of the information in the image is lost.

Professional photographers tend to avoid working repeatedly with JPEG images as

continually loading and saving the image causes the image to lose quality.

JPEGs don't support layers - most photo manipulation software use layers; to save

images as JPEGs the image has to be "flattened".

JPEGs only support 8 bit images. Modern digital cameras can operate in 12, 14 or

16 bit mode but if the images are saved as JPEGs, the extra information is

discarded

2.3 JPEG 2000

The JPEG-2000 image compression system has a rate-distortion advantage over

the original JPEG.JPEG-2000 is an emerging standard for still image

compression.

As digital imagery becomes more common place and of higher quality, there is the

need to manipulate more and more data

Thus, image compression must not only reduce the necessary storage and

bandwidth requirements, but also allow extraction for editing, processing, and

targeting particular devices and applications.

More importantly, it also allows extraction of different resolutions, pixel

fidelities, and regions of interest, components, and more, all from a single

compressed bit stream.

This allows an application to manipulate or transmit only the essential information

for any target device from any JPEG 2000 compressed source image.



2.3.1 FEATURES OF JPEG-2000

State-of-the-art low bit-rate compression performance

Progressive transmission by quality, resolution, component, or spatial Locality.

Lossy and lossless compression (with lossless decompression available

Naturally through all types of progression)

Random (spatial) access to the bit stream

Pan and zoom (with decompression of only a subset of the compressed data)

Compressed domain processing (e.g., rotation and cropping)

Region of interest coding by progression

Limited memory implementations.

The aims of JPEG 2000 are not only improved compression performance over JPEG

but also adding (or improving) features such as scalability and edit ability.Very low and

very high compression rates are supported in JPEG 2000.

In fact, the graceful ability of the design to handle a very large range of effective bit

rates is one of the strengths of JPEG 2000. While there is a modest increase in

compression performance of JPEG 2000 compared to JPEG, the main advantage offered

by JPEG 2000 is the significant flexibility of the code stream[3].

Figure 2.2 : COMPRESSION (ENCODING AND DECODING)

Conventional methods of lossless compression such as Zip reversibly reduce file

sizes while preserving information by compacting regularities in the data. Jpeg

compression goes one step further, by organizing regularities in the visual perception of

an image and using lossy compression to reduce the file size of the image.

This process involves a small but irreversible loss of quality as discussed in the errors

below.

http://photo.net/learn/jpeg/#percep

http://photo.net/learn/jpeg/#qual

http://photo.net/learn/jpeg/#errors



Figure 2.3: Edges in a typical image - zoomed in to see the pixels.

After compression most of the edges are still present, with some artifacts

The main steps are as follows (some require heavy math‟s)

Standard color space is 256 levels of Red, Green, Blue (16.7 million RGB

colors)

Color space separation (YCbCr) from RGB

e.g. Y (luminance) = 0.299 * R + 0.587 * G + 0.114 *B

Spatial separation into 8X8 pixels blocks

Sub-sampling (if required) of chroma and Cr (colors) in 16X16 pixel blocks

Discrete Cosine Function (DCF) of the spatial frequencies in each 8X8

block

Quantization of the spatial frequency matrix

Lossless compression of the resulting matrix

For illustrative purposes large images are not needed, since the entire JPEG

compression takes place inside 8X8 (or 16X16) pixel blocks. Note that a

JPEG cannot be compressed further using Zip or any other process of

lossless compression, since this is already done as the last step of the JPEG

encoding.

Note the predominance of green and blue pixels, with few red pixels

The green channel is closest to what the eye sees, with blue having next

most artifacts

Decoding an image from a JPEG is the reverse of this process, and does not

need elaboration here.



2.4 Implications

JPEG-2000 is unlikely to replace JPEG in low complexity applications at bit rates

in the range where JPEG performs well. However, for applications requiring

either higher quality or lower bitrates, or any of the features provided, JPEG-2000

should be a welcome standard.

JPEG-2000 provides better rate-distortion performance, for any given rate, than

The original JPEG standard. However, the largest improvements are observed at

very high and very low bitrates.

The improvements in the “near visually lossless” realm are more modest

(approximately 20%). Thus, widespread adoption of the new standard will likely

be based on the JPEG-2000 feature set.

While JPEG provided different methods of generating progressive bit streams,

with JPEG-2000 the progression is simply a matter of the order the compressed

bytes are stored in a file.



CHAPTER 3

DISCRETE WAVELET TRANSFORM



3.1 INTRODUCTION

The transform of a signal is just another form of representing the signal. It does

not change the information content present in the signal. The Wavelet Transform provides

a time-frequency representation of the signal. It was developed to overcome the short

coming of the Short Time Fourier Transform (STFT), which can also be used to analyze

non-stationary signals. While STFT gives a constant resolution at all frequencies, the

Wavelet Transform uses multi-resolution technique by which different frequencies are

analyzed with different resolutions. A wave is an oscillating function of time or space and

is periodic. In contrast, wavelets are localized waves. They have their energy

concentrated in time or space and are suited to analysis of transient signals. While Fourier

Transform and STFT use waves to analyze signals, the Wavelet Transform uses wavelets

of finite energy.

Figure3.1 Demonstration of (a) a Wave and (b) a Wavelet

The wavelet analysis is done similar to the STFT analysis. The signal to be analyzed

is multiplied with a wavelet function just as it is multiplied with a window function in

STFT, and then the transform is computed for each segment generated[4].

However, unlike STFT, in Wavelet Transform, the width of the wavelet function

changes with each spectral component. The Wavelet Transform, at high frequencies,

gives good time

resolution and poor frequency resolution, while at low frequencies, the Wavelet

Transform gives good frequency resolution and poor time resolution.



3.2 CONTINUOUS WAVELET TRANSFORM AND

WAVELET SERIES

The Continuous Wavelet Transform (CWT) is provided by equation 2.1, where

x(t) is the signal to be analyzed. ψ(t) is the mother wavelet or the basis function. All the

wavelet functions used in the transformation are derived from the mother wavelet through

translation (shifting) and scaling (dilation or compression).

............(3.1)

The mother wavelet used to generate all the basis functions is designed based on

some desired characteristics associated with that function. The translation parameter τ

relates to the location of the wavelet function as it is shifted through the signal. Thus, it

corresponds to the time information in the Wavelet Transform. The scale parameter s is

defined as |1/frequency| and corresponds to frequency information. Scaling either dilates

(expands) or compresses a signal. Large scales (low frequencies) dilate the signal and

provide detailed information hidden in the signal, while small scales (high frequencies)

compress the signal and provide global information about the signal. Notice that the

Wavelet Transform merely performs the convolution operation of the signal and the basis

function. The above analysis becomes very useful as in most practical applications, high

frequencies (low scales) do not last for a long duration, but instead, appear as short bursts,

while low frequencies (high scales) usually last for entire duration of the signal.

The Wavelet Series is obtained by discretizing CWT. This aids in computation of

CWT using computers and is obtained by sampling the time-scale plane. The sampling

rate can be changed accordingly with scale change without violating the Nyquist

criterion. Nyquist criterion states that, the minimum sampling rate that allows

reconstruction of the original signal is 2ω radians, where ω is the highest frequency in the

signal. Therefore, as the scale goes higher (lower frequencies), the sampling rate can be

decreased thus reducing the number of computations[4].



3.3 DWT

The Wavelet Series is just a sampled version of CWT and its computation may

consume significant amount of time and resources, depending on the resolution required.

The Discrete Wavelet Transform (DWT), which is based on sub-band coding is found to

yield a fast computation of Wavelet Transform. It is easy to implement and reduces the

computation time and resources required.

The foundations of DWT go back to 1976 when techniques to decompose discrete

time signals were devised . Similar work was done in speech signal coding which was

named as sub-band coding. In 1983, a technique similar to sub-band coding was

developed which was named pyramidal coding. Later many improvements were made to

these coding schemes which resulted in efficient multi-resolution analysis schemes.

In CWT, the signals are analyzed using a set of basis functions which relate to

each other by simple scaling and translation. In the case of DWT, a time-scale

representation of the digital signal is obtained using digital filtering techniques. The

signal to be analyzed is passed through filters with different cutoff frequencies at different

scales.

3.4 Filter Banks

3.4.1 Multi-Resolution Analysis using Filter Banks

Filters are one of the most widely used signal processing functions. Wavelets can

be realized by iteration of filters with rescaling. The resolution of the signal, which is a

measure of the amount of detail information in the signal, is determined by the filtering

operations, and the scale is determined by upsampling and downsampling (subsampling)

operations.

The DWT is computed by successive lowpass and highpass filtering of the

discrete time-domain signal as shown in figure 2.2. This is called the Mallat algorithm or

Mallat-tree decomposition. Its significance is in the manner it connects the continuous-



time mutiresolution to discrete-time filters. In the figure, the signal is denoted by the

sequence x[n], where n is an integer. The low pass filter is denoted by G0

while the high

pass filter is denoted by H0. At each level, the high pass filter produces detail information,

d[n], while the low pass filter associated with scaling function produces coarse

approximations, a[n].

Figure 3.2: Three level decomposition tree

At each decomposition level, the half band filters produce signals spanning only half the

frequency band. This doubles the frequency resolution as the uncertainity in frequency is

reduced by half. In accordance with Nyquist‟s rule if the original signal has a highest

frequency of ω, which requires a sampling frequency of 2ω radians, then it now has a

highest frequency of ω/2 radians. It can now be sampled at a frequency of ω radians thus

discarding half the samples with no loss of information. This decimation by 2 halves the

time resolution as the entire signal is now represented by only half the number of

samples. Thus, while the half band low pass filtering removes half of the frequencies and

thus halves the resolution, the decimation by 2 doubles the scale.

With this approach, the time resolution becomes arbitrarily good at high

frequencies, while the frequency resolution becomes arbitrarily good at low frequencies.

The time-frequency plane is thus resolved as shown in figure 1.1(d) of Chapter 1. The

filtering and decimation process is continued until the desired level is reached. The

maximum number of levels depends on the length of the signal. The DWT of the original

signal is then obtained by concatenating all the coefficients, a[n] and d[n], starting from

the last level of decomposition.



Figure 3.3: Three level reconstruction tree

Figure 3.3 shows the reconstruction of the original signal from the wavelet coefficients.

Basically, the reconstruction is the reverse process of decomposition. The approximation and

detail coefficients at every level are up-sampled by two, passed through the low pass and high

pass synthesis filters and then added. This process is continued through the same number of

levels as in the decomposition process to obtain the original signal. The Mallat algorithm

works equally well if the analysis filters, G0

and H0, are exchanged with the synthesis filters,

G11

.

3.4.2 Conditions for Perfect Reconstruction

In most Wavelet Transform applications, it is required that the original signal be

synthesized from the wavelet coefficients. To achieve perfect reconstruction the analysis

and synthesis filters have to satisfy certain conditions. Let G0(z) and G

1(z) be the low pass

analysis and synthesis filters, respectively and H0(z) and H

1(z) the high pass analysis and

synthesis filters respectively. Then the filters have to satisfy the following two conditions

as given in equation :

G0 (-z) =G

1 (z) + H

0 (-z). H

1 (z) = 0 (3.2)

G0 (z) =G

1 (z) + H

0 (z). H

1 (z) = 2z

-d

(3.3)

The first condition implies that the reconstruction is aliasing-free and the second

condition implies that the amplitude distortion has amplitude of one. It can be observed

that the perfect reconstruction condition does not change if we switch the analysis and

synthesis filters.



There are a number of filters which satisfy these conditions. But not all of them give

accurate Wavelet Transforms, especially when the filter coefficients are quantized. The

accuracy of the Wavelet Transform can be determined after reconstruction by calculating

the Signal to Noise Ratio (SNR) of the signal. Some applications like pattern recognition

do not need reconstruction, and in such applications, the above conditions need not apply.

3.4.3 Classification of Wavelets

We can classify wavelets into two classes: (a) orthogonal and (b) biorthogonal. Based on

the application, either of them can be used.

(a)Features of orthogonal wavelet filter banks

The coefficients of orthogonal filters are real numbers. The filters are of the same

length and are not symmetric. The low pass filter, G0

and the high pass filter, H0

are

related to each other by

H0 (z) = z

-N

G0 (-z

-1

) (3.3)

The two filters are alternated flip of each other. The alternating flip automatically

gives double-shift orthogonality between the lowpass and highpass filters, i.e., the scalar

product of the filters, for a shift by two is zero. i.e., ΣG[k] H[k-2l] = 0, where k,lЄZ .

Filters that satisfy equation are known as Conjugate Mirror Filters (CMF). Perfect

reconstruction is possible with alternating flip.

Also, for perfect reconstruction, the synthesis filters are identical to the analysis

filters except for a time reversal. Orthogonal filters offer a high number of vanishing

moments. This property is useful in many signal and image processing applications. They

have regular structure which leads to easy implementation and scalable architecture.

(b)Features of biorthogonal wavelet filter banks

In the case of the biorthogonal wavelet filters, the low pass and the high pass

filters do not have the same length. The low pass filter is always symmetric, while the

high pass filter could be either symmetric or anti-symmetric. The coefficients of the filters

are either real numbers or integers.

For perfect reconstruction, biorthogonal filter bank has all odd length or all even

length filters. The two analysis filters can be symmetric with odd length or one symmetric

and the other antisymmetric with even length. Also, the two sets of analysis and synthesis



filters must be dual. The linear phase biorthogonal filters are the most popular filters for

data compression applications.

3.5 Wavelet Families

There are a number of basis functions that can be used as the mother wavelet for

Wavelet Transformation. Since the mother wavelet produces all wavelet functions used in

the transformation through translation and scaling, it determines the characteristics of the

resulting Wavelet Transform. Therefore, the details of the particular application should be

taken into account and the appropriate mother wavelet should be chosen in order to use

the Wavelet Transform effectively[4].

[7] [7] [8]

[9] [e] [f]

[ g]

Figure 3.4 Wavelet families (a) Haar (b) Daubechies4 (c) Coiflet1 (d) Symlet2 (e) Meyer (f)

Morlet (g) Mexican Hat.



CHAPTER 4

Overview of DWT Algorithm and DA for DWT



4.1 DWT of an image

A low pass filter and a high pass filter are chosen, such that they exactly Halve the

frequency range between themselves. The filter pass is called the analysis filter pair. First

the low pass filter is applied for each row of data, thereby getting the low frequency

components of the row. But since the low pass filter is a half band filter, the output data

contains frequencies only in the first half of the original frequency range. So they can be

sub sampled by two, so that the output data now contains only half the original number of

samples. Now the high pass filter is applied for the same row of data, and similarly the

high pass components are separated and placed by the side of the low pass components.

This procedure is done for all rows. Next, the filtering is done for each column of

the intermediate data. The resulting two dimensional array of coefficients contains four

bands of data, each labeled as LL(low- Low), HL (high-low), LH (Low-High) and HH

(High-High). The LL band can be decomposed once again in the same manner, thereby

producing even more sub bands. This can be done up to any level, thereby resulting in a

pyramidal decomposition as shown.

The LL band at the highest level can be classified as most important and the other

detail bands can be classified as of lesser importance, with the degree of importance

decreasing from the top of the pyramid to the bands at the bottom[5].



Figure 4.1:Image encoding.

4.2 INVERSE DWT OF AN IMAGE.

Just as a forward transform is used to separate the image data into various classes of

importance a reverse transform is used to reassemble the various classes of data into a

reconstructed image. A pair of high pass and low pass filters is used here also. Then filter

pair is called the synthesis filter pair. The filtering procedure is just the opposite. We start

from the topmost level, apply the filters column wise first and then row wise and proceed

to the next level, till we reach the first level.

In this section the theoretical background and algorithm development is discussed.

The first recorded mention of what is now called a "wavelet" seems to be in 1909, in a

thesis by Alfred Haar. An image is represented as a two dimensional (2D) array of

coefficients, each coefficient representing the brightness level in that point[5].

When looking from a higher perspective, it is not possible to differentiate between

coefficients as more important ones, and lesser important ones. But thinking more

intuitively, it is possible. Most natural images have smooth color variations, with the fine

details being represented as sharp edges in between the smooth variations.

Technically, the smooth variations in color can be termed as low frequency

components and the sharp variations as high frequency components. The low frequency

components (smooth variations) constitute the base of an image, and the high frequency



components (the edges which give the detail) add upon them to refine the image, thereby

giving a detailed image. Hence the averages/smooth variations are demanding more

importance than the details.

In wavelet analysis, A signal can be separated into approximations or averages

and detail or coefficients. Averages are the high-scale, low frequency components of the

signal. The details are the low scale, high frequency components. If we perform forward

transform on a real digital signal, we wind up with twice as much data as we started with.

That‟s why after filtering down sampling has to be done.

The inverse process is how those components can be assembled back into the

original signal without loss of information. This process is called reconstruction or

synthesis. The mathematical manipulation that affects synthesis is called the inverse

discrete wavelet transform. The original signal, is reconstructed from the wavelet

coefficients. Where wavelet analysis involves filtering and down sampling, the wavelet

reconstruction process consists of up sampling and filtering. The DWT algorithm consists

of Forward DWT (FDWT) and Inverse DWT (IDWT) which are shown in fig.4.2

respectively.



Figure 4.2:Two dimensional decomposition.

Figure 4.3:Two Dimensional IDWT



The FDWT can be performed on a signal using different types of filters such as

db7, db4 or Haar. The Forward transform can be done in two ways, such as matrix

multiply method and linear equations. In the FDWT, each step calculates a set of wavelet

averages (approximation or smooth values) and a set of details. If a data set s0, s1, ... sN-

1 contains N elements, there will be N/2 averages and N/2 detail values. The averages are

stored in the upper half and the details are stored in the lower half of the N element array.

4.3. DISTRIBUTED ARITHMETIC FOR DWT

With the rapid progress of VLSI design technologies, many processors based on

audio and image signal processing have been developed recently. The two-dimensional

discrete wavelet transform (2D DWT) plays a major role in image/video compression

standard, such as JPEG2000 and MPEG4. Wavelets decompose the signal at one level of

approximation and detail signals at the next level. Thus subsequent levels can add more

details to the information content. Presently, research on the DWT is attracting a great

deal of attention. In addition to audio and image compression, the DWT has important

applications in many areas, such as computer graphics, numerical analysis, radar target

distinguishing and so forth. The architecture of the 2D DWT is mainly composed of the

multi-rate filters. Because extensive computation is involved in the practical applications,

e.g., digital cameras, high efficiency and low-cost hardware is indispensable. These

applications require real-time manipulation of digital images. Because this, fast

algorithms and specific circuits for DWT have been developed.

Among the methods for two-dimensional DWT, the indirect method based on

row-column decomposition is the best adapted to a hardware implementation. Distributed

arithmetic (DA) was proposed about two decades ago and has since used widely in VLSI

implementations of DSP architectures. Most of these applications are computation

intensive with multiplication and/or addition being the predominant operation. The main

advantage of distributed arithmetic approach is that it speeds up the multiply process by

pre-computing all the possible medium values and storing these values in a ROM. The

input data can then be used to directly address the memory and the result.

In this section, we only consider the separable 2-D DWT. We proposed an

efficient 2D DWT architecture based on distributed arithmetic. This architecture only

uses RAM in the



proposed architecture instead of ROM because the size of ROM grows exponentially

when the number of inputs and internal precision increase. Distributed arithmetic and

row-column

decomposition reduce the hardware amount and enhance the speed performance.

The basic architecture deals with the separable 2D DWT, whose mathematical

formulas are defined as follows. In the decomposition, the wavelet coefficients of any

stage can be calculated from DWT of the previous stage. The following expression shows

how the k-th scaling wavelet coefficients Xh(n,j+1) and Xg(n,j+1) are obtained at (j+1)

stage.

𝑋ℎ 𝑛, 𝑗 + 1 = 𝑋ℎ 𝑚, 𝑗 ℎ 𝑚 − 2𝑘 (4.1)

𝑋𝑔 𝑛, 𝑗 + 1 = 𝑋𝑔 𝑚, 𝑗 𝑔(𝑚 − 2𝑘) (4.2)

Figure 1, shows a classical one level implementation of analysis and synthesis of the

DWT system using filter bank structure. The input signal x(n) is filtered by the analysis

process using the low pass h and the high pass g filters. The symbols ↑2 and ↓2 are up

sampling and down sampling by a factor of two for decimating the filter results. The

synthesis process is dual of its analysis process[5].

Figure 4.4: One level implementation using filter bank

To derive Distributed Architecture for DWT, consider the following sum of products:

𝑦 = 𝐴𝑘𝑋𝑘𝐿𝑘=1 (4.3)

Where Ak is the fixed coefficient of the filter bank and Xk is the input samples. The

decomposed expression of (1) in form of DA can be written as equation 2:



Note that in equation (2), A is the distributed arithmetic matrix of fixed coefficients Aki,

where k = 1, 2, ...,L; i=1, 2, ...,N-1, with Ak N-1 is the MSB and Ak 0 is the LSB . It

should be noted again that, in Distributed Architecture for DWT, the bits of the

coefficients are distributed unlike conventional DA, where the bits of the input data words

are distributed. Furthermore, Distributed Architecture matrix contains only 0 and 1, which

means the computation of Y can be carried out just by shifting and adding of the input

vectors.

Matrix A is very important to DA architecture of DWT since its structure can lead

to savings in hardware to implement the computations. It only consists of 0's and 1's.

Therefore, we refer to matrix A as the Adder Butterflies. Overall, by using DA

architecture of DWT, inner product of vectors (1) can be implemented generally with

basic adder cells.

Consider the four high pass filter coefficients as

[2 3 4 2]

And,the image bits as

[X0,X1,X2…….X7]

The first image bit X0 enters the system filter and the sum of the product(sop) output is

given as Y0

Y0=2X0+3X-1+4X-2

Now X1 enters and Y1 is

Y1=3X0+2X1

Similarly Y2=4X0+3X1+2X2

Y3=2X0+3X1+4X2+2X3

Y4=2X1+4X2+3X3+2X4

Y5=2X0+4X1+3X4+2X5

Y6=……….

Y7=……….



Now let us take the input samples as [1 2 3 4 5 6 7 8] for easy computation and

configuring and realizing the distributive arithmetic architecture

H=[2 3 4 5],the filter coefficients

And the input samples as

X=[1 2 3 4 5 6 7 8]

And the computation is done as shown below

[2 3 4 5]

8 7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

……………

…………….

Y0=2*1

Y1=3*1+2*2

Y2=4*1+3*2+2*3

Y3=5*1+4*2+3*3+2*4

Y4=5*2+4*3+3*4+2*5

Y5=……..

…..

…..

Now Y3 can be re-written as

Y3=5*[0 0 1]+4*[0 1 0]+3*[0 1 1]+2*[1 0 0]

=5 [0*22+0*2

1+1*2

0] +

4 [0*22+1*2

1+0*2

0] +

3 [0*22+1*2

1+1*2

0] +

2 [1*22+0*2

1+0*2

0]

Y3=

0 ∗ 5+

0 ∗ 4+

0 ∗ 3+

0 ∗ 2

* 22 +

0 ∗ 5+

1 ∗ 4+

1 ∗ 3+

0 ∗ 2

∗ 21 +

1 ∗ 5+

0 ∗ 4+

1 ∗ 3+

0 ∗ 2

*20

Similarly the input samples can be lasted till fourth bit in contrast with the earlier

example,where in we used 3-bits for each sample…in other words each input sample is

represented by the 4-bits



Lets consider another example to demonstrate the syntax of the above mentioned

equation for efficient realization.i.e,

H=[2 3 4 5]

X=[9 7 5 8]

The generalized or simple output representation is given as

Y=

1 ∗ 5+

0 ∗ 4+

0 ∗ 3+

1 ∗ 2

*23 +

0 ∗ 5+

1 ∗ 4+

1 ∗ 3+

0 ∗ 2

*22 +

0 ∗ 5+

0 ∗ 4+

1 ∗ 3+

0 ∗ 2

*21+

0 ∗ 5+

1 ∗ 4+

1 ∗ 3+

1 ∗ 2

*20

Now we can realize that, a total of 24 (or 16) coefficients can be stored in the rom.

On being developed the simplified representation of the sum of the product (sop) equation

Y,we move further to design the rough (prototype) architecture of the DA.

It consists of the SISO‟s and the ROM

Where the number of SISO registers depend upon the filters employed for

particular application.

1-bit of data is serially fed for each clock pulse into the SISO register and

shifting operation (i.e, either left or right shift) is performed.at the end of

the operation 1-bit output is serially fed out of the register.

ROM contains the mappable-coefficients.In other words the LSB‟s(least

significant bits) of all the input samples are mapped over to ROM for

corresponding coefficients.If LSB‟s match altogether with the ROM contents,then

the corresponding coefficient will be given as output



Figure 4.5: Showing the mapping the serial out on rom coefficients

The above prototype has the following reviews

It takes 3-clock cycles to load 1-single SISO

At the 4th clock 1-bit of SISO0 will be right-shifted into SISO1

Therefore,a total of 3*4=12 clock cycle is needed to load the shifters

The next 3-clocks are needed to map the LAB‟s of shifters on to ROM.and

generate 1-output.i,e, to compute the first output by parallel mapping of serial

outputs.

So a total of 21-cycles are required to generate first 3-outputs.

Another input sample enters at SISI0 for the next 3-clocks and SISO3 contents are

replaced by contents of SISO2

The distributed arithmetic architecture is incomplete without the section discussed below

The output of the ROM is given to the ADDER

ADDER contents are summed with the ACCUMULATOR contents.Accumulator

is initialized to zero at first.

The output of the Adder is right-shifted and stored in Accumulator.

The protype along with Adder,Accumulator and Shifter shows the perfect Distributed

Arithmetic Architecture.This is diagrammatically represented as shown below



Figure 4.6: General Distributive Arithmetic Architecture



CHAPTER 5

IMAGE COMPRESSION



5.1 PROBLEM STATEMENT

Distributed Arithmatic Architecture can be used for 9/7 tap filters in 2-

dimensional discrete wavelet transform. The 9-tap High-pass filter with the DA

Architecture has the following salient features

It has 9-SISO‟s,each of 8-bits

The First 8*9=72 cycles are for loading all SISO‟s

8-cycles for generating the first output

Next 8-cycles to load the first SISO

Next 8-cycles to compute

Total=8+8+8=24 cycles are required to compute the first 3-outputs

The first output is fed to Adder,which is summed with accumulator

contents.i.e,zero

The output is right shifted and fed to Accumulator.

And the cycle continues

The 7-tap low pass filter with the DA Architecture has the following salient features

It has 7-SISO‟s,each of 8-bits

The First 8*7=56 cycles are for loading all SISO‟s

5-cycles for generating the first output

Next 5-cycles to load the first SISO

Next 5-cycles to compute

Total=5+5+5=15 cycles are required to compute the first 3-outputs

The first output is fed to Adder,which is summed with accumulator

contents.i.e,zero

The output is right shifted and fed to Accumulator.

And the cycle continues



8-BIT SIS0 ROM-MEMORY MAP

DISTRIBUTED

ARITHMETIC

9-SISO

Figure 5.1: 9-tap high pass filter with DA-architecture

29

ADDER

ACCUM

ULATOR SHIFTER



8-BIT SIS0 ROM-MEMORY MAP

DISTRIBUTED

ARITHMETIC

7-SISO

Figure 5.2: 7-tap lowpass filter with DA-architecture

5.2 PROPOSED ARCHITECTURE

The architecture is based on popular Daubechies 9/7 filter bank (floating point)

used in JPEG2000 and MPEG4. The floating-point 9/7 forward transform uses two

analysis filter h

(high-pass) and g (low-pass). Without loss of generality we assume accuracy up to 5

decimal places, hence the coefficients are shown in equation 3. The finite precession of

the hardware

limits the accurate representation of the floating-point number; hence for the purpose of

implementation we will represent coefficients with accuracy of 13 bits. The assumption is

reasonable as 13 bits representation gives high enough accuracy for the fixed-point

implementation.

27

ADDER

ACCUM

ULATOR SHIFTER



The 9/7 tap high and low pass FIR filter are in the following:

Y(2i+1)=(-0.45656)*[X(2i-2)+X(2i+4)]

+(0.028772*[X(2i-1)+X(2i++3)]

+0.295636*[X92i)+X(2i+2)]

+(-0.55743)*X(2i+1);

Y(2i)=(0.0266749)*[X(2i-4)+X(2i+4)]

+(-0.016864)*[X(2i-3)+X(2i+3)]

+(-0.078223)*[X(2i-2)+X(2i+2)]

+(0.260864)*[X(2i-1)+X(2i+1)]

+(.002949)*[X(2i)];

So the coefficient matrixes are as the following:

h = [(-0.045636 )(0.028772)

(0.295636) (-0.557543 )];

g = [(0.026749) (-0.016864 )

(-0.078223 )(0.266864 )(0.602949 )];

Then the coefficient matrix (9/7 tap high and low pass FIR filter) can be distributed in to

13 bits (coefficient word length), so h and g can also be written as[5]:

h = [(2(2−12) 2−11 . . . (2−1) 2−0 ] Aℎ (5.1)

g=[(2(2−12) 2−11 . . . (2−1) 2−0 ] A𝑔 (5.2)

Aℎ and A𝑔are represented as following:



CHAPTER 6

SOFTWARE REFERENCE MODEL



6.1 MATLAB

6.1.1 OVERVIEW OF MATLAB

MATLAB is a high-performance language for technical computing. It integrates

computation, visualization, and programming in an easy-to-use environment where

problems and solutions are expressed in familiar mathematical notation. Typical uses

include:

Math and computation

Algorithm development

Data acquisition

Modeling, simulation, and prototyping

Data analysis, exploration, and visualization

Scientific and engineering graphics

Application development, including graphical user interface building

MATLAB is an interactive system whose basic data element is an array that does

not require dimensioning. This allows you to solve many technical computing problems,

especially those with matrix and vector formulations, in a fraction of the time it would

take to write a program in a scalar no interactive language such as C or FORTRAN.

The name MATLAB stands for matrix laboratory. MATLAB was originally

written to provide easy access to matrix software developed by the LINPACK and

EISPACK projects. Today, MATLAB engines incorporate the LAPACK and BLAS

libraries, embedding the state of the art in software for matrix computation.

MATLAB has evolved over a period of years with input from many users. In

university environments, it is the standard instructional tool for introductory and

advanced courses in mathematics, engineering, and science. In industry, MATLAB is the

tool of choice for high-productivity research, development, and analysis[6].



6.2 MATLAB SYSTEM

The MATLAB system consists of these main parts:

6.2.1 DESKTOP TOOLS AND DEVELOPMENT ENVIRONMENT

This is the set of tools and facilities that help you use MATLAB functions and

files. Many of these tools are graphical user interfaces. It includes the MATLAB desktop

and Command Window, a command history, an editor and debugger, a code analyzer and

other reports, and browsers for viewing help, the workspace, files, and the search path.

6.2.2 MATLAB MATHEMATICAL FUNCTION LIBRARY

This is a vast collection of computational algorithms ranging from elementary

functions, like sum, sine, cosine, and complex arithmetic, to more sophisticated functions

like matrix inverse, matrix Eigen values, Bessel functions, and fast Fourier transforms.

6.2.3 MATLAB LANGUAGE

This is a high-level matrix/array language with control flow statements, functions,

data structures, input/output, and object-oriented programming features. It allows both

„programming in the small‟ to rapidly create quick and dirty throw-away programs, and

„programming in the large‟ to create large and complex application programs.

6.2.4 GRAPHICS

MATLAB has extensive facilities for displaying vectors and matrices as graphs,

as well as annotating and printing these graphs. It includes high-level functions for two-

dimensional and three-dimensional data visualization, image processing, animation, and

presentation graphics. It also includes low-level functions that allow you to fully

customize the appearance of graphics as well as to build complete graphical user

interfaces on your MATLAB applications.



6.2.5 MATLAB EXTERNAL INTERFACES

This is a library that allows you to write C and FORTRAN programs that interact

with MATLAB. It includes facilities for calling routines from MATLAB (dynamic

linking), calling MATLAB as a computational engine, and for reading and writing MAT-

files.

6.3 IMAGE PROCESSING TOOLBOX

6.3.1 INTRODUCTION

Image Processing Toolbox is a collection of functions that extend the capability of

the MATLAB numeric computing environment. The toolbox supports a wide range of

image processing operations, including

Spatial image transformations

Morphological operations

Neighborhood and block operations

Linear filtering and filter design

Transforms

Image analysis and enhancement

Image registration

Region of interest operations

Many of the toolbox functions are MATLAB M-files, a series of MATLAB

statements that implement specialized image processing algorithms. We can view the

MATLAB code for these functions using the statement

„type function_name’

We can extend the capabilities of Image Processing Toolbox by writing your own

M-files, or by using the toolbox in combination with other toolboxes, such as Signal

Processing Toolbox and Wavelet Toolbox.



6.3.2 READ AND DISPLAY AN IMAGE

First, clear the MATLAB workspace of any variables and close open figure windows.

‘Close all’

To read an image, use the imread command. The example reads one of the sample

images included with Image Processing Toolbox, pout.tif, and stores it in an array named

I. I = imread ('pout.tif');

Now display the image. The toolbox includes two image display functions:

imshow and imtool. Imshow is the toolbox's fundamental image display function. Imtool

starts the Image Tool which presents an integrated environment for displaying images and

performing some common image processing tasks. The Image Tool provides all the

image display capabilities of imshow but also provides access to several other tools for

navigating and exploring images, such as scroll bars, the Pixel Region tool, Image

Information tool, and the Contrast Adjustment tool.

6.3.3 IMAGE APPEARANCE IN THE WORKSPACE

To see how the imread function stores the image data in the workspace, check the

Workspace browser in the MATLAB desktop. The Workspace browser displays

information about all the variables you create during a MATLAB session. The imread

function returned the image data in the variable I, which is a 291-by-240 element array of

uint8 data. MATLAB can store images as uint8, uint16, or double arrays.

6.3.4 IMPROVING IMAGE CONTRAST

pout.tif is a somewhat low contrast image. To see the distribution of intensities in

pout.tif, we can create a histogram by calling the imhist function.

figure, imhist(I)

The intensity range is rather narrow. It does not cover the potential range of

[0, 255], and is missing the high and low values that would result in good contrast. The

toolbox provides several ways to improve the contrast in an image.

One way is to call the histeq function to spread the intensity values over the full

range of the image, a process called histogram equalization.I2 = histeq(I);Display the new

equalized image, I2, in a new figure window.

figure, imshow(I2)



6.4 PSNR AND MSE FOR IMAGES

6.4.1 PSNR

Compute peak signal-to-noise ratio (PSNR) between images. The PSNR block

computes the peak signal-to-noise ratio, in decibels, between two images[6]. This ratio is

often used as a quality measurement between the original and a compressed image. The

higher the PSNR, the better the quality of the compressed image[1].

6.4.2 MSE

In statistics, the mean square error or MSE of an estimator is one of many ways to

quantify the difference between an estimator and the true value of the quantity being

estimated.

MSE is a risk function, corresponding to the expected value of the squared error

loss or quadratic loss. MSE measures the average of the square of the "error." The error is

the amount by which the estimator differs from the quantity to be estimated.

http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Estimator

http://en.wikipedia.org/wiki/Estimator

http://en.wikipedia.org/wiki/Risk_function

http://en.wikipedia.org/wiki/Expected_value



CHAPTER 7:

FPGA IMPLEMENTATION



7.1 FPGA basic design Flow Overview:

The ISE design flow comprises the following steps: design entry, design

synthesis, design implementation, and Xilinx device programming. Design verification,

which includes both functional verification and timing verification, takes places at

different points during the design flow. This section describes what to do during each

step. For additional details on each design step, click a box in the following figure.

Figure 7.1:FPGA Basic Design Flow

7.2 Design Summary:

Design entry is the first step in the ISE design flow. During design entry, you

create your source files based on your design objectives. You can create your top-level

design file using a Hardware Description Language (HDL), such as VHDL, Verilog, or

ABEL, or using a schematic. You specify your top-level module type when you create

your project as described in Creating a Project[9].



You can use multiple formats for the lower-level source files in your design. Different

source types are available, depending on your project properties (top-level module type,

device type, synthesis tool, and language). You can create these source files in Project

Navigator, as described in Creating a Source File. Some source types launch additional

tools to help you create the file, as described in Source File Types.

Table 7.1: Design Summary

image_inte Project Status

Project File: image_inte.ise Current State: Programming File

Generated

Module

Name: video

Errors: No Errors

Target

Device: xc2vp30-7ff896

Warnings: 703 Warnings (676 new,

0 filtered)

Product

Version:

ISE 10.1 -

WebPACK

Routing Results: All Signals Completely

Routed

Design Goal: Balanced

Timing

Constraints: All Constraints Met

Design

Strategy:

Xilinx Default

(unlocked)

Final Timing

Score: 0 (Timing Report)

image_inte Partition Summary [-]

No partition information was found.

Device Utilization Summary [-]

Logic Utilization Used Available Utilization Note(s)

Number of Slice Flip Flops 113 27,392 1%

Number of 4 input LUTs 333 27,392 1%

Logic Distribution

Number of occupied Slices 203 13,696 1%

Number of Slices containing only related logic 203 203 100%

Number of Slices containing unrelated logic 0 203 0%

Total Number of 4 input LUTs 378 27,392 1%

Number used as logic 333

Number used as a route-thru 45

Number of bonded IOBs 31 556 5%

Number of RAMB16s 15 136 11%

Number of BUFGMUXs 2 16 12%

Number of DCMs 1 8 12%



Performance Summary [-]

Final Timing Score: 0 Pinout Data: Pinout

Report

Routing Results: All Signals Completely Routed Clock Data: Clock

Report

Timing Constraints: All Constraints Met

Detailed Reports [-]

Report Name Status Generated Errors Warnings Infos

Synthesis Report Current Wed 9. Jun

00:02:30 2010 0

676 Warnings (676 new, 0

filtered)

25

Infos

(24

new, 0

filtered)

Translation

Report Current

Wed 9. Jun

00:03:10 2010 0


filtered) 0

Map Report Current Wed 9. Jun

00:03:52 2010 0


filtered)

3 Infos

(0 new,

0

filtered)

Place and Route

Report Current

Wed 9. Jun

00:05:20 2010 0

1 Warning (0 new, 0

filtered)

2 Infos

(0 new,

0

filtered)

Static Timing

Report Current

Wed 9. Jun

00:05:50 2010 0 0

3 Infos

(0 new,

0

filtered)

Bitgen Report Current Wed 9. Jun

00:06:38 2010 0 0

2 Infos

(0 new,

0

filtered)

Table 7.1(Contd): Design Summary

7.3 Timing Constraints:

The ISE software allows you to enter timing constraints that describe the timing

performance requirements of the design. Providing a concise set of constraints achieves

the following:

Allows the software to create a design that meets your requirements.

Allows you to compare the constraints to the performance of the resulting

design, using the timing reports output by the ISE software. By analyzing the

timing reports, you can identify the paths in the design that may require



coding modifications, placement directives, or additional constraints to

achieve timing closure. Increases the performance of the ISE software by

reducing the memory and runtime requirements[9].

Timing Constraints

Met Constraint Check

Worst

Case

Slack

Best Case

Achievable

Timing

Errors

Timing

Score

Yes Autotimespec constraint for clock

net dwt1/dw_2d/d1/s1

SETUP

HOLD

N/A

0.701ns 3.018ns N/A 0 0 0


net vga_out_pixel_clock_OBUF

SETUP

HOLD

N/A

0.562ns 23.268ns N/A 0 0 0


net dwt1/dw_2d/clkd3

SETUP

HOLD

N/A

0.635ns 1.863ns N/A 0 0 0



SETUP

HOLD

N/A

0.701ns 2.949ns N/A 0 0 0


net dwt1/dw_2d/d2/s

SETUP

HOLD

N/A

0.721ns 3.035ns N/A 0 0 0



SETUP

HOLD

N/A

0.712ns 3.138ns N/A 0 0 0


net dwt1/dw_2d/d3/s

SETUP

HOLD

N/A

0.713ns 3.297ns N/A 0 0 0


net dwt1/dw_2d/d1/s

SETUP

HOLD

N/A

0.855ns 3.445ns N/A 0 0 0

Table 7.2: Timing Constraints

7.4 Clock Report

This report contains information on the resource utilization of each clock region

and lists any clock conflicts between global clock buffers in a clock region.

Clock Report

Clock Net Resource Locked Fanout Net

Skew(ns)

Max

Delay(ns)

vga_out_pixel_clock_OBUF BUFGMUX0P No 443 0.233 1.212

dwt1/dw_2d/clkd3 BUFGMUX4P No 50 0.024 1.122

dwt1/dw_2d/d2/s BUFGMUX6P No 36 0.020 1.006



dwt1/dw_2d/d1/s1 Local

63 0.038 2.192


62 0.145 2.480


62 0.046 2.239

Table 7.3: Clock Report



7.5 Synthesis Report:

After design entry and optional simulation, you run synthesis. In the Sources tab, select

Synthesis/Implementation from the Design View drop-down list, and select the top

module. In the Processes tab, double-click Synthesize.

The ISE software includes Xilinx Synthesis Technology (XST), which synthesizes

VHDL, Verilog, or mixed language designs to create Xilinx-specific netlist files known

as NGC files. Unlike output from other vendors, which consists of an EDIF file with an

associated NCF file, NGC files contain both logical design data and constraints. XST

places the NGC file in your project directory and the file is accepted as input to the

Translate (NGDBuild) step of the Implement Design process. To specify XST as your

synthesis tool, you must set the Synthesis Tool Project Property to XST, as described in

Changing Project, Source, and Snapshot Properties[9].

Table 7.4: Synthesis Report

---- Source Parameters

Input File Name : "video.prj"

Input Format : mixed

Ignore Synthesis Constraint File : NO

---- Target Parameters

Output File Name : "video"

Output Format : NGC

Target Device : xc2vp30-7-ff896

---- Source Options

Top Module Name : video

Automatic FSM Extraction : YES

FSM Encoding Algorithm : Auto

Safe Implementation : No

FSM Style : lut

RAM Extraction : Yes

RAM Style : Auto

ROM Extraction : Yes

Mux Style : Auto

Decoder Extraction : YES

Priority Encoder Extraction : YES

Shift Register Extraction : YES

Logical Shifter Extraction : YES

XOR Collapsing : YES



ROM Style : Auto

Mux Extraction : YES

Resource Sharing : YES

Asynchronous To Synchronous : NO

Multiplier Style : auto

Automatic Register Balancing : No

---- Target Options

Add IO Buffers : YES

Global Maximum Fanout : 500

Add Generic Clock Buffer(BUFG) : 16 :16

Register Duplication : YES

Slice Packing : YES

Optimize Instantiated Primitives : NO

Convert Tristates To Logic : Yes

Use Clock Enable : Yes

Use Synchronous Set : Yes

Use Synchronous Reset : Yes

Pack IO Registers into IOBs : auto

Equivalent register Removal : YES

---- General Options

Optimization Goal : Speed

Optimization Effort : 1

Library Search Order : video.lso

Keep Hierarchy : NO

Netlist Hierarchy : as_optimized

RTL Output : Yes

Global Optimization : AllClockNets

Read Cores : YES

Write Timing Constraints : NO

Cross Clock Analysis : NO

Hierarchy Separator : /

Bus Delimiter : <>

Case Specifier : maintain

Slice Utilization Ratio : 100

BRAM Utilization Ratio : 100

Verilog 2001 : YES

Auto BRAM Packing : NO

Slice Utilization Ratio Delta : 5 Table 7.4(Contd): Synthesis Report



7.6 RTL Schematic:

The synthesized design can be viewed as a schematic in the register transfer level

(RTL) viewer. This view displays gates and elements independently of the targeted Xilinx

device.

Figure 7.2 : RTL Schematic

The schematic shows a representation of the pre-optimized design in terms of generic

symbols, such as adders, multipliers, counters, AND gates, and OR gates, which are

independent of the targeted Xilinx device. Viewing this schematic may help you discover

design issues early in the design process. [9]

Figure 7.3: Pictorial view of RTL schematic



Figure 7.4: Technology Schematic Overview

The synthesized design can be viewed as a schematic in a technology schematic viewer.

This view displays gates and elements as they will appear on the Xilinx device.

Figure 7.5: Technology Schematic

7.7 Implement Design:

Translate:

The Translate process merges all of the input net-lists and design constraints and outputs

a Xilinx native generic database (NGD) file, which describes the logical design reduced

to Xilinx primitives. See the following table for details. [9]



Table 7.5: Translate Process

NGDBUILD Design Results Summary:

Number of errors : 0

Number of warnings : 25

Total memory usage is 102260 kilobytes

7.7.1 Floor plan design after Translate

The general steps in the basic flow are as follows:

Design is created, synthesized, and transformed into an NGD file. The NGD file includes

location constraints that originated in your design source, a UCF, or an NCF. The file

may also include references or instances of IP macros. Floorplan Editor reads the NGD

file, reads the design hierarchy, pulls in data for any IP macros, and creates a

representation of your design. While reading the NGD file, Floorplan Editor interprets

any I/O standards applied to buffers connected to I/Os and displays them in the Design

Objects tab window. Floorplan Editor modifies one or more UCFs[9].

Note Floorplan Editor does not create the UCF. If you don‟t already have one, you must

first create at least one UCF using the Project Navigator New Source or Add Source

Translate Process

Command line tool NGDBuild

Tcl command process run "Translate"

Input files EDIF, SEDIF, EDN, EDF, NGC, UCF, NCF, URF, NMC,

BMM

Output files BLD (report), NGD

Process properties Translate Properties

Tools available after

running process Constraints Editor, Floorplan Editor, Floorplanner, PACE

Note Each of these tools modifies the UCF file. When you

rerun Translate with the updated UCF, the NGD file is

updated.



functions. The UCFs are then input to NGDBuild and the remainder of the Xilinx

implementation flow is completed.When the initial constraints are from your design

source or an NCF, these constraints cannot be removed when a UCF is used as Floorplan

Editor output. They can only be overridden by constraints applied in Floorplan Editor and

finally be saved in a UCF.

7.8 Map Report:

The Map process maps the logic defined by an NGD file into FPGA elements, such as

CLBs and IOBs. The output design is a native circuit description (NCD) file that

physically represents the design mapped to the components in the Xilinx FPGA. See the

following table for details. [9]

Map Process

Command line tools MAP

Tcl command process run "Map"

Input files NGD, NMC, NCD, NGM

Note The NCD and NGM files are for guiding.

Output files NCD, PCF, NGM, MRP (report), GRF, MAP, PSR

Process Properties Map Properties

Tools available after running process Floorplanner, FPGA Editor, Timing Analyzer

Table 7.6: Map Process

Table 7.7: Map Report(Below)

Target Device : xc2vp30

Target Package : ff896

Target Speed : -7

Design Summary

Number of errors : 0

Number of warnings : 2

Logic Utilization:

Number of Slice Flip Flops : 113 out of 27,392 1%

Number of 4 input LUTs :339 out of 27,392 1%



Logic Distribution:

Number of occupied Slices : 200 out of 13,696 1%

Number of Slices containing only related logic : 200 out of 200 100%

Number of Slices containing unrelated logic : 0 out of 200 0%

Total Number of 4 input LUTs : 371 out of 27,392 1%

Number used as logic : 339

Number used as a route-thru : 32

Number of bonded IOBs : 31 out of 556 5%

Number of RAMB16s : 15 out of 136 11%

Number of BUFGMUXs : 2 out of 16 12%

Number of DCMs : 1 out of 8 12%

Peak Memory Usage : 231 MB

Total REAL time to MAP completion : 11 secs

Total CPU time to MAP completion : 8 secs Table 7.7(Contd): Map Report

7.9 Place and Route:

The Place and Route process takes a mapped NCD file, places and routes the design,

and produces an NCD file that is used as input for bitstream generation.

Place and Route Process

Command line tools PAR

Tcl command process run "Place & Route"

Input files NCD, PCF

Note In addition to the NCD file from MAP, PAR also accepts an NCD file for guiding.

Output files NCD, PAR (report), PAD, CSV, TXT, GRF, DLY

Process Properties Place & Route Properties

Tools available after running

process

Floorplanner, FPGA Editor, Timing Analyzer, TRACE, XPower

Analyzer

Table 7.8: Place and Route Process



Device Utilization Summary:

Number of BUFGMUXs 2 out of 16 12%

Number of DCMs 1 out of 8 12%

Number of External IOBs 31 out of 556 5%

Number of LOCed IOBs 31 out of 31 100%

Number of RAMB16s 15 out of 136 11%

Number of SLICEs 200 out of 13696 1%

Overall effort level (-ol) Standard

Placer effort level (-pl) High

Placer cost table entry (-t) 1

Router effort level (-rl) Standard

REAL time consumed by placer 24 secs

CPU time consumed by placer 21 secs

Table 7.9: Place and Route



Figure 7.6: View of the design after routed in place and route[9]

Data in X-power analyser

Table 7.10: X-power analyzer[9]



7.10 Configure target device:

Target Device Properties

The following properties are available for the Configure Target Device process for a

CPLD or FPGA device.

iMPACT Project File

The iMPACT Project File (IPF) contains information from a previous session of

iMPACT. If you specify an IPF file in this property and run the Configure Target

Device process, the target device will be configured according to the settings in the

specified IPF file. If Default is specified here, the target device will be configured

according to the settings in the default IPF file, <ISE_image_inte>.ipf.

Port to be used (Advanced): Here we use USB, specifies the port you would like to

use for configuration. Auto-default causes the software to search every port for a

connection, automatically detect an available cable, and connect to it.Run Generate

Target PROM/ACE FileIf selected, the Configure Target Device process will

automatically run the Generate Target PROM/ACE File process to generate a PROM

or ACE file before configuring the target device.The file will be generated using the

information from the .ipf file specified in the iMPACT Project File property. When

Automatically Generate Target PROM/ACE File is set to True (checkbox is checked),

the PROM or ACE file is generated in the background before the target device is

configured. This is useful for quick PROM or System ACE file regeneration when a

bitstream has changed.[9]



Figure 7.7: Output Simulation Window

Figure 7.8: Snapshot1 of Image Compression Chip(internal view 1)



Figure 7.9: Image Compression Chip (internal view 2)

Figure 7.10: Image Compression Chip Internal View 3



RESULT:

[a] Original image [b] Reconstructed image

The original image and the reconstructed image are compared with respect to

PSNR(db) and MSE and the observation made is that, the original and the reconstructed

image are similar to each other. This validates our result.



CHAPTER 8

Conclusion and Scope for Future Work



8.1 Conclusion An image compression algorithm was simulated using Matlab to comprehend the

process of image compression. Modifications on the padding style showed reduction in

the error, because it offers a better reproduction of image at its edges. It also supports

faithful reproduction of the image, keeping the size of the transform coefficient matrix

equal to the image size. For the VLSI implementation of an image compression encoder,

Verilog HDL was chosen.

The proposed theoretical benefits of DA are realizing the full potential of FPGA

architecture for hardware implementation and achieving large parallelism. The relative

area and speed efficiencies of DA turns out to be good on hardware implementation on

FPGA. DA approach can achieve near to maximum clock rates possible with a given

FPGA technology using only basic 4-LUT based blocks and the fast ripple carry chains

while the multi stage modulo adders required in RNS implementation are slow, even for

small word lengths, and as such the accumulator stage becomes the performance

bottleneck.

It has also been observed that implementation of large adders in FPGAs with fast

carry chains is quite fast and the adder delay scales up less than linearly with increasing

word lengths. In light of the implementation results it is clear that DA based architectures

have an area, speed and simplicity advantage over any other method based on

implementations. It is in this context, we can say that DA implementations are superior

when targeting FPGAs.

8.2 Scope for Future work

The newly developed concept of „sparsity‟ in signal processing can be used in the

context of Image Compression. The first step of the scheme is to use a sparsifying

transform on the image. The sparse set of coefficients is encoded via Sparse PCA.

Wavelet Transform had been used profusely for image compression tasks. But the choice

is not the ideal one. The partial reconstruction error from wavelet coefficients is an order

of magnitude higher than the ideal error rate for many critical application. Image

compression can be carried in the curvelet domain—a better choice compared to

wavelets, atleast theoretically, since the reconstruction error rate with curvelet

coefficients is of the same asymptotic order as that of the ideal error rate.



APPENDIX-A

FPGA ARCHITECTURE



A Field Programmable Gate Array (FPGA) is a semiconductor device containing

programmable logic components and programmable interconnects. The programmable

logic components can be programmed to duplicate the functionality of basic logic gates

such as AND, OR, XOR, NOT or more complex combinational functions such as

decoders or imple mathematical functions. In most FPGAs these programmable logic

components also include memory elements, which may be simple flip-flops or more

programmable logic components also include memory elements, which may be simple

flip-flops or more complete blocks of memories. FPGAs are generally slower than their

Application Specific Integrated Circuits (ASIC) counterparts, as they can‟t handle as

complex a design and draw more power.[7]

The programmable logic devices are capable of implementing a sequential

network but not a complete digital system. Programmable gate arrays(PGAs) and

complex programmable logic devices(CPLDs) are more flexible and more versatile and

can be used and can be used to implement a complete digital system on a chip. Some of

the largest devices can implement a small microprocessor.

A typical PGA is an IC that contains an array of identical logic cells with

programmable interconnections. We can program the functions realized by each logic cell

and connections between the cells. Such PGAs are called FPGAs since they are field

programmable.[7]



A.1 APPLICATION OF FPGA

[7]

[7]

Figure A.1: Multiply accumulate operation

(a) Conventional implementation

(b) Distributed arithmetic implementation.



A.2 Virtex-II Pro

One of most advanced FPGA families in industry is the FPGA series produced by

Xilinx. The Virtex user programmable gate array comprises two major configurable

elements: configurable logic blocks (CLBs) and input/output blocks (IOBs). Each CLB is

composed of two slices as shown in Figure A.2 A slice contains 4- input, 1-output LUTs

and two registers. Interconnections between these elements are configured by

multiplexers controlled by SRAM cells programmed by a user‟s bit stream. The LUTs

allow any function of five inputs, and two functions of four inputs, or some functions of

up to nine inputs to be created within a CLB slice. This structure allows a very powerful

method of implementing

arbitrary, complex digital logic.

Figure A.2: Simplified Architecture of Virtex configurable logic block.

Virtex FPGAs are programmed using Verilog HDL; a popular hardware

description language . The language has capabilities to describe the behavioral nature of a

design, the data flow of a design, a design‟s structural composition, delays and a

waveform generation mechanism. Models written in this language can be verified using a

Verilog simulator. As a programming and development environment, Xilinx ISE

Foundation Series tools have been used to produce a physical implementation for the



Viretx FPGA. Field programmable gate arrays (FPGAs) provide a new implementation

platform for the discrete wavelet transform.

FPGAs maintain the advantages of the custom functionality of VLSI ASIC

devices, while avoiding the high development costs and the inability to make design

modifications after production. Furthermore, FPGAs inherit design flexibility and

adaptability of software implementations.

We make maximal utilization of the lookup table (LUT) architecture of Virtex

FPGAs by reformulating the wavelet transform computation in accordance with the

distributed arithmetic algorithm. Distributed arithmetic makes extensive use of look-up

tables, which makes it ideal for implementing the discrete wavelet transform functions

onto the LUT-based architecture of Virtex FPGAs. Moreover, distributed arithmetic is

suitable for low power portable applications because it allows replacement of costly

multipliers with shifts and look-up tables. Indeed, one of the unique features of our

discrete wavelet transform implementation is exploiting the natural match between the

Virtex architecture and distributed arithmetic.

Three more unique features are worth mentioning at this point.

The first is the flexibility of the implementation which is made possible by

virtue of the re-programmability of FPGAs which allows easy

modification of wavelet type.

The second is that, unlike most reported implementations which

concentrate on architecture development, this implementation goes down

to the actual implementation level.

Finally, describes implementations for both the forward and inverse

transforms.

A.3 INTERNAL CONFIGURATION

The basic Virtex logic element in a CLB is the slice . Two slices are present in

each CLB as shown in Figure 2.6. Each slice contains 4-input, 1-output LUTs and two

registers. Interconnections between these elements are configured by multiplexers

controlled by SRAM cells programmed by a user‟s bitstream. The LUTs allow any

function of five inputs, and two



functions of four inputs, or some functions of up to nine inputs to be created within a

CLB slice. The outputs of these functions may be registered, or the registers may be used

independently of the LUTs. This structure allows a very powerful method of

implementing arbitrary, complex digital logic.

Figure A.3. Simplified Virtex configurable slice

A.3.1 LOOK-UP TABLE IMPLEMENTATION

Virtex slices have the ability to implement distributed memory instead of logic.

Each 4- input LUT in a slice may be used to implement a 16x1 ROM or RAM, or the two

LUTs may be combined together to create a 32x1 ROM or RAM or a 16x1 dual-port

RAM. This allows each slice to trade logic resources for memory in order to maximize

the resources available for a particular application.



APPENDIX- B

VIRTEX-II PRO ARCHITECTURE



B.1 Introduction

The XUP Virtex-II Pro Development System provides an advanced hardware platform

that consists of a high performance Virtex-II Pro Platform FPGA surrounded by a

comprehensive collection of peripheral components that can be used to create a complex

system and to demonstrate the capability of the Virtex-II Pro Platform FPGA[8].

Features

Figure-I shows the Virtex-II Trainer, which includes the following components and

features:

Virtex-II Pro FPGA with PowerPC 405 cores

Up to 2 GB of Double Data Rate (DDR) SDRAM

System ACE controller and Type II Compact Flash connector for FPGA

configuration and data storage

Embedded Platform Cable USB configuration port

High-speed SelectMAP FPGA configuration from Platform Flash In-System

Programmable Configuration PROM

Support for “Golden” and “User” FPGA configuration bitstreams

On-board 10/100 Ethernet PHY device

Silicon Serial Number for unique board identification

RS-232 DB9 serial port

Two PS-2 serial ports

Four LEDs connected to Virtex-II Pro I/O pins

Four switches connected to Virtex-II Pro I/O pins

Five push buttons connected to Virtex-II Pro I/O pins

Six expansion connectors joined to 80 Virtex-II Pro I/O pins with over-voltage

protection

High-speed expansion connector joined to 40 Virtex-II Pro I/O pins that can be

used

differentially or single ended

AC-97 audio CODEC with audio amplifier and speaker/headphone output and line

level output

Microphone and line level audio input

On-board XSGA output, up to 1200 x 1600 at 70 Hz refresh



Three Serial ATA ports, two Host ports and one Target port

Off-board expansion MGT link, with user-supplied clock

100 MHz system clock, 75 MHz SATA clock

Provision for user-supplied clock

On-board power supplies

Power-on reset circuitry

PowerPC 405 reset circuitry

Block Diagram

Figure B.1: XUP Virtex-II Pro Development System Block Diagram[8]



Figure B.2: XUP Virtex-II Pro Development System Board Photo[8]



B.2 Virtex-II Pro FPGA:

U1 is a Virtex-II Pro FPGA device packaged in a flip-chip-fine-pitch FF896 BGA

package. Two different capacity FPGAs can be used on the XUP Virtex-II Pro

Development System with no change in functionality. Table B-1 lists the Virtex-II Pro

device features.

Features XC2VP20 XC2VP30

Slices 9280 13969

Array Size 56x46 80x46

Distributed RAM 290Kb 428Kb

Multiplier Blocks 88 136

Block RAMs 1584Kb 2448Kb

DCMs 8 8

PowerPC RISC Cores 2 2

Multi-Gigabit Transceivers 8 8

Table B-1: XC2VP20 and XC2VP30 Device Features

Power Supplies and FPGA Configuration

The XUP Virtex-II Pro Development System is powered from a 5V regulated

power supply. On-board switching power supplies generate 3.3V, 2.5V, and 1.5V for the

FPGA, and peripheral components and linear regulators power the MGTs.

The board has provisioning for current measurement for all of the FPGA digital power

supplies, as well as application of external power if the capacity of the on-board

switching power supplies is exceeded.

The XUP Virtex-II Pro Development System provides several methods for the

configuration of the Virtex-II Pro FPGA. The configuration data can originate from the

internal Platform Flash PROM (two potential configurations), the internal CompactFlash

storage media (eight potential configurations), and external configurations delivered from

the embedded Platform Cable USB or parallel port interface



Truth table of LUT3 Column1 Column2 Column3

I1 I2 IO O

0 0 0 0

0 0 1 0

0 1 0 1

0 1 1 0

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 1 Table B.2: Truth table of LUT3

Figure B.3: Internal structure of a basic LUT3[9]

Figure B.4: Karnaugh Map for LUT3[9]



Figure B.5: I/O Connections to Peripheral Devices[8]

Multi-Gigabit Transceivers

Four of the eight Multi-Gigabit Transceivers (MGTs) that are present in the

Virtex-II Pro FPGA are brought out to connectors and can be utilized by the user. Three

of the bidirectional MGT channels are terminated at Serial Advanced Technology

Attachment (SATA) connectors and the fourth channel terminates at user-supplied Sub-

Miniature A (SMA) connectors. The MGT transceivers are equipped with a 75 MHz

clock source that is independent for the system clock to support standard SATA

communication. An additional MGT clock source is available through a differential user-

supplied (SMA) connector pair. Two of the ports with SATA connectors are configured

as Host ports and the third SATA port is configured as a Target port to allow for simple

board-to-board networking. [8]



Figure B.6: SMA-based MGT Connections

Signal MGT Location PAD Name I/O Pin Notes

SATA_PORT0_TXN MGT_X0Y1 TXNPAD4 A27 HOST

SATA_PORT0_TXP MGT_X0Y1 TXPPAD4 A26 —

SATA_PORT0_RXN MGT_X0Y1 RXNPAD4 A24 —

SATA_PORT0_RXP MGT_X0Y1 RXPPAD4 A25 —

SATA_PORT0_IDLE — — B15 —

SATA_PORT1_TXN MGT_X1Y1 TXNPAD6 A20 TARGET




SATA_PORT1_IDLE — — AK3 —

SATA_PORT2_TXN MGT_X2Y1 TXNPAD7 A14 HOST




SATA_PORT2_IDLE — — C15 —

MGT_TXN MGT_X3Y1 TXNPAD9 A7 USER

MGT_TXP MGT_X3Y1 TXPPAD9 A6 —

MGT_RXN MGT_X3Y1 RXNPAD9 A4 —

MGT_RXP MGT_X3Y1 RXPPAD9 A5 —

MGT_CLK_N — — G16 BREFCLK

MGT_CLK_P — — F16 —

EXTERNAL_CLOCK_N — — F15 BREFCLK2

EXTERNAL_CLOCK_P — — G15 — Table B.3: SATA and MGT Signals



System RAM

The XUP Virtex-II Pro Development System has provision for the installation of

user supplied JEDEC-standard 184-pin dual in-line Double Data Rate Synchronous

Dynamic RAM memory module. The board supports buffered and unbuffered memory

modules with a capacity of 2 GB or less in either 64-bit or 72-bit organizations. The 72-

bit organization should be used if ECC error detection and correction is required.

System ACE Compact Flash Controller

The System Advanced Configuration Environment (System ACE) Controller

manages FPGA configuration data. The controller provides an intelligent interface

between an FPGA target chain and various supported configuration sources. The

controller has several ports: the Compact Flash port, the Configuration JTAG port, the

Microprocessor (MPU) port and the Test JTAG port. The XUP Virtex-II Pro

Development System supports a single System ACE Controller. The Configuration JTAG

ports connect to the FPGA and front expansion connectors. The Test JTAG port connects

to the JTAG port header and USB2 interface CPLD, and the MPU ports connect directly

to the FPGA. [8]

Serial Ports

The XUP Virtex-II Pro Development System provides three serial ports: a single

RS-232 port and two PS/2 ports. The RS-232 port is configured as a DCE with hardware

handshake using a standard DB-9 serial connector. This connector is typically used for

communications with a host computer using a standard 9-pin serial cable connected to a

COM port. The two PS/2 ports could be used to attach a keyboard and mouse to the XUP

Virtex-II Pro Development System. All of the serial ports are equipped with level-shifting

circuits, because the Virtex-II Pro FPGAs cannot interface directly to the voltage levels

required by RS-232 or PS/2.

User LEDs, Switches, and Push Buttons

A total of four LEDs are provided for user-defined purposes. When the FPGA

drives a logic 0, the corresponding LED turns on. A single four-position DIP switch and

five push buttons are provided for user input. If the DIP switch is up, closed, or on, or the

push button is pressed, a logic 0 is seen by the FPGA, otherwise a logic 1 is indicated. [8]



Table B.4: System Configuration Status LEDs

Expansion Connectors

A total of 80 Virtex-II Pro I/O pins are brought out to four user-supplied 60-pin

headers and two 40-pin right angle connectors for user-defined use. The 60-pin headers

are designed to accept ribbon-cable connectors, with every second signal a ground for

signal integrity. Some of these signals are shared with the front-mounted right-angle

connectors. The front-mounted connectors support Digilent expansion modules. In

addition, a highspeed connector is provided to support Digilent high-speed expansion

modules. This connector provides 40 single-ended or differential I/O signals in addition

to three clocks. [8]

XSGA Output

The XUP Virtex-II Pro Development System includes a video DAC and 15-pin

highdensity D-sub connector to support XSGA output. The video DAC can operate with a

pixel clock of up to 180 MHz. This allows for a VESA-compatible output of 1280 x 1024

at 75 Hz refresh and a maximum resolution of 1600 x 1200 at 70 Hz refresh[8].



DCM and XSGA Controller Settings for Various XSGA Formats

Table B.5: DCM and XSGA Controller settings for various XSGA Formats

USB 2 Programming Interface

The XUP Virtex-II Pro Development System includes an embedded USB 2.0

microcontroller capable of communications with either high-speed (480 Mb/s) or

fullspeed (12 Mb/s) USB hosts. This interface is used for programming or configuring the

Virtex-II Pro FPGA in Boundary-Scan (IEEE 1149.1/IEEE 1532) mode. Target clock

speeds are selectable from 750 kHz to 24 MHz. The USB 2.0 microcontroller attaches to

a desktop or laptop PC with an off-the-shelf high-speed A-B USB cable[8].



Table B.6: XSGA Output Connections

Using the CPU Debug Port and CPU Reset

The CPU Debug port (J36) is a right angle header that provides connections to the

debugging resources of the PowerPC 405 CPU core[8].

The PowerPC 405 CPU cores include dedicated debug resources that support a variety of

debug modes for debugging during hardware and software development. These debug

resources include:

Internal debug mode for use by ROM monitors and software debuggers

External debug mode for use by JTAG debuggers



Debug wait mode, which allows the servicing of interrupts while the processor

appears to be stopped

Real-time trace mode, which supports event triggering for real time tracing Debug

modes and events are controlled using debug registers in the processor. The debug

registers are accessed either through software running on the processor or through the

JTAG port. The debug modes, events, controls, and interfaces provide a powerful

combination of debug resources for hardware and software development tools.

The JTAG port interface supports the attachment of external debug tools, such as

the powerful ChipScope Integrated Logic Analyzer, a powerful tool providing logic

analyzer capabilities for signals inside an FPGA, without the need for expensive external

instrumentation. Using the JTAG test access port, a debug tool can single-step the

processor and examine the internal processor state to facilitate software debugging. This

capability complies with standard JTAG hardware for boundary scan system testing.

External debug mode can be used to alter normal program execution. It provides the

ability to debug system hardware as well as software. The mode supports multiple

functions: starting and stopping the processor, single-stepping instruction execution,

setting breakpoints, as well as monitoring processor status. Access to processor resources

is provided through the CPU Debug Port.

The PPC405 JTAG Debug Port supports the four required JTAG signals:

CPU_TCK, CPU_TMS, CPU_TDO, and CPU_TDI. It also implements the optional

CPU_TRST signal. The frequency of the JTAG clock signal, CPU_TCK, can range from

0 MHz up to one-half of the processor clock frequency. The JTAG debug port logic is

reset at the same time the system is reset, using the CPU_TRST signal. When

CPU_TRST is asserted, the JTAG TAP controller returns to the test-logic reset state.

Figure B.7: CPU Debug Connector Pinouts



Figure B.7 shows the pinout of the header used to debug the operation of software in the

CPU. This is accomplished using debug tools, such as the Xilinx Parallel Cable IV or

third party tools. The JTAG debug resources are not hardwired to specific pins and are

available for attachment in the FPGA fabric, making it possible to route these signals to

whichever FPGA pins the user prefers to use. The signal-pin connections used on the

XUP Virtex- II Pro Development System are identified in Table B.7 along with the

recommended I/O characteristics. Level shifting circuitry is provided for all signals to

convert from the 3.3V levels at the connector to the 2.5V levels at the FPGA.[8]

Table B.7: CPU Debug Port Connections and CPU Reset

The RESET_RELOAD pushbutton (SW1) provides two different functions

depending on how long the switch is depressed. If the switch is activated for more than 2

seconds, the XUP Virtex-II Pro Development System undergoes a complete reset and

reloads the selected configuration. If, however, the switch is activated for less than 2

seconds, aprocessor reset pulse of 100 microseconds is applied to the

PROCESSOR_RESET_Z signal.[8]

Configuring the FPGA:

At power up, or when the RESET_RELOAD push button (SW1) is pressed for

longer than 2 seconds, the FPGA begins to configure. The two configuration

methods supported, JTAG and master SelectMAP, are determined by the

CONFIG SOURCE switch, the most significant switch (left side) of SW9.

If the CONFIG SOURCE switch is closed, on, or up, a high-speed SelectMap

byte-wide configuration from the on-board Platform Flash configuration PROM

(U3) is selected as the configuration source. This is identified to the user through

the illumination of the PROM CONFIG LED (D19).



The Platform Flash configuration PROM supports two different FPGA

configurations (versions) selected by the position of the PROM VERSION switch,

the least significant switch (right side) of SW9.

If the PROM VERSION switch is closed, on, or up, the GOLDEN configuration

from the onboard Platform Flash configuration PROM is selected as the

configuration data. This is identified to the user through the illumination of the

GOLDEN CONFIG LED (D14). This configuration can be a board test utility

provided by Xilinx, or another safe default configuration. It is important to note

that the PROM VERSION switch is only sampled on board powerup and after a

complete system reset. This means that if this switch is changedafter board

powerup, the RESET_RELOAD pushbutton (SW1) must be pressed for more than

2 seconds for the new state of the switch to be recognized.

If the PROM VERSION switch is open, off, or down, a User configuration from

the on-board Platform Flash configuration PROM is selected as the configuration

data. This configuration must be programmed into the Platform Flash PROM from

the JTAG

The Platform Cable USB interface or the USB interface.

The Platform Flash is normally disabled after the FPGA is finished configuring

and has asserted the DONE signal. If additional data is made available to the

FPGA after the completion of configuration, jumper JP9 must be moved from the

NORMAL to the EXTENDED position to permanently enable the PROM and

allow the FPGA to clock out the additional data using the FPGA_PROM_CLOCK

signal.

If the CONFIG SOURCE switch is open, off, or down, a lower speed JTAG-based

configuration from Compact Flash or external JTAG source is selected as the

configuration source. This is identified to the user through the illumination of the

JTAG CONFIG LED (D20).

The JTAG-based configuration can originate from several sources: the Compact

Flash card, a PC4 cable connection through J27, and a USB to PC connection

through J8 the embedded Platform Cable USB interface.

If a JTAG-based configuration is selected, the default source is from the Compact

Flash port (J7). The System ACE controller checks the associated Compact Flash

socket and storage device for the existence of configuration data. If configuration

data exists on the storage device, the storage device becomes the source for the

configuration data. The file structure on the Compact Flash storage device



supports up to eight different configuration data files, selected by the triple CF

CONFIG SELECT DIP switch (SW8).

During JTAG configuration, the SYSTEMACE STATUS LED (D12) flashes until

the configuration process is completed, and the FPGA asserts the FPGA_DONE

signal and illuminates the DONE LED (D4). At any time, the RESET_RELOAD

pushbutton (SW1) can be used to load any of the eight different configuration data

files by pressing the switch for more than 2 seconds.

If a JTAG-based configuration is selected and a valid configuration file is not

found on the Compact Flash card by the System ACE controller (U2), the

SYSTEMACE ERROR LED (D11) flashes, and the System ACE controller

connects to an external JTAG port for FPGA configuration.

The default external source for FPGA configuration is the high-speed embedded

Platform Cable USB configuration port (J8) and is enabled when the System ACE

controller does not find configuration data on the storage device.

If a USB-equipped host PC is not available as a configuration source, then a

Parallel Cable 4 (PC4) interface can be used instead by connecting a PC4 cable to

J27.

Flash configuration PROM is enabled, the FPGA Start-Up Clock should be set to

CCLK in the Startup Options section of the Process Options for the generation of

the programming file, otherwise JTAG Clock should be selected.[8]

Figure B.8: Configuration data path



Table B.8: System Configuration Status LEDs

Four status LEDs show the configuration state of the XUP Virtex-II Pro Development

System at all times. The user can see the configuration source, configuration version, and

tell when the configuration has completed from the status LEDs shown in Table B-8.



References

[1] Rafael C. Gonzalez, University of Tennessee and Richard E. Woods, MedData

Interactive, Digital Image Processing, Pearson Prentice Hall, 3 edition, 2009.

[2] Performance Analysis of Image Compression Using Wavelets by Sonja Grgic,

Mislav Grgic, and Branka Zovko-Cihlar IEEE TRANSACTIONS ON

INDUSTRIAL ELECTRONICS, VOL. 48, NO. 3, JUNE 2001

[3] JPEG official website,-www.jpeg.org/jpeg2000.html

[4] Performance Analysis of Image Compression Using Wavelets by Sonja Grgic,

Mislav Grgic, and Branka Zovko-Cihlar IEEE TRANSACTIONS ON

INDUSTRIAL ELECTRONICS, VOL. 48, NO. 3, JUNE 2001

[5] An Efficient VLSI Implementation of Distributed Architecture for DWT by Xixin

Cao, Qingqing Xie from School of Software and Microelectronics, Peking

University,Beijing, China

[6] Matlab support for Image Compression from

http://www.mathworks.nl/matlabcentral/fileexchange/4772

[7] http://www.support.xilinx.com/support/techsup/tutorials

[8] Virtex-II Pro Datasheet http://www.xilinx.com/support/documentation/virtex-

ii_pro_data_sheets.htm

[9] Xilinx-XST software toolbar help

http://www.mathworks.nl/matlabcentral/fileexchange/4772

http://www.support.xilinx.com/support/techsup/tutorials

http://www.xilinx.com/support/documentation/virtex-%09ii_pro_data_sheets.htm

http://www.xilinx.com/support/documentation/virtex-%09ii_pro_data_sheets.htm

Image Compression Dwt Project Report

Documents