Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms Master’s Thesis Division of Electronics Systems Department of Electrical Engineering Linköping University By Ahmet Caglar Amin Ojani Report number: LiTH-ISY-EX--08/4265--SE Linköping, December 2008
104
Embed
708825/FULLTEXT01.pdf · Presentation Date 2008-12-16 Publishing Date (Electronic version) Department and Division Department of Electrical Engineering Division of Electronic Systems
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Evaluation and Hardware Implementation of
Real-Time Color Compression Algorithms
Master’s Thesis
Division of Electronics Systems
Department of Electrical Engineering
Linköping University
By
Ahmet Caglar
Amin Ojani
Report number: LiTH-ISY-EX--08/4265--SE
Linköping, December 2008
Evaluation and Hardware Implementation of
Real-Time Color Compression Algorithms
Master’s Thesis
Division of Electronics Systems
Department of Electrical Engineering
at Linköping Institute of Technology
By
Ahmet Caglar
Amin Ojani
LiTH-ISY-EX--08/4265--SE
Supervisor: Henrik Ohlsson,
Ericsson Mobile Platforms (EMP)
Examiner: Oscar Gustafsson,
Electronics Systems, Linköping University
Linköping, December 2008
Presentation Date 2008-12-16
Publishing Date (Electronic version)
Department and Division
Department of Electrical Engineering Division of Electronic Systems
URL, Electronic Version http://www.ep.liu.se
Publication Title Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms
Author(s) Amin Ojani, Ahmet Caglar
Abstract A major bottleneck, for performance as well as power consumption, for graphics hardware in mobile devices is the amount of data that needs to be transferred to and from memory. In, for example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large and frequent color buffer data transfers. In a graphics hardware block color data is typically processed using RGB color format. For both 3D graphic rasterization and image composition several pixels needs to be read from and written to memory to generate a pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which impacts both performance and power consumption. Therefore it is important to minimize the amount of color buffer data. One way of reducing the memory bandwidth required is to compress the color data before writing it to memory and decompress it before using it in the graphics hardware block. This compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the hardware accelerator does not have to wait for data. In this thesis, we investigated several exact (lossless) color compression algorithms from hardware implementation point of view to be used in high throughput hardware. Our study shows that compression/decompression datapath is well implementable even with stringent area and throughput constraints. However memory interfacing of these blocks is more critical and could be dominating.
Keywords Graphics Hardware, Color Compression, Image Compression, Mobile Graphics, Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.
Language
X English Other (specify below)
Number of Pages 88
Type of Publication
Licentiate thesis X Degree thesis Thesis C-level Thesis D-level Report Other (specify below)
ISBN (Licentiate thesis)
ISRN: LiTH-ISY-EX--08/4265--SE
Title of series (Licentiate thesis)
Series number/ISSN (Licentiate thesis)
i
Abstract
A major bottleneck, for performance as well as power consumption, for graphics hardware in
mobile devices is the amount of data that needs to be transferred to and from memory. In, for
example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large
and frequent color buffer data transfers. In a graphic hardware block color data is typically
processed using RGB color format. For both 3D graphic rasterization and image composition
several pixels needs to be read from and written to memory to generate a pixel in the frame buffer.
This generates a lot of data traffic on the memory interfaces which impacts both performance and
power consumption. Therefore it is important to minimize the amount of color buffer data. One
way of reducing the memory bandwidth required is to compress the color data before writing it to
memory and decompress it before using it in the graphics hardware block. This
compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the
hardware accelerator does not have to wait for data. In this thesis, we investigated several exact
(lossless) color compression algorithms from hardware implementation point of view to be used
in high throughput hardware. Our study shows that compression/decompression datapath is well
implementable even with stringent area and throughput constraints. However memory interfacing
of these blocks is more critical and could be dominating.
Keywords: Graphics Hardware, Color Compression, Image Compression, Mobile Graphics,
1.1 COLOR BUFFER AND GRAPHICS HARDWARE ...................................................................................................... 2 1.2 COLOR BUFFER COMPRESSION VS. IMAGE COMPRESSION .................................................................................. 3 1.3 STRUCTURE OF THE REPORT ............................................................................................................................. 3
APPENDIX A ............................................................................................................................................................. 75
APPENDIX B ............................................................................................................................................................. 85
TEST IMAGE SETS .................................................................................................................................................... 85 B.1 Standard Photographic Test Images ............................................................................................................ 85 B.2 Computer Generated Test Scenes ................................................................................................................ 86 B.3 Computer Generated User Menu Scenes ..................................................................................................... 87
vii
Table of Figures
FIGURE 1: COMPRESSOR/DECOMPRESSOR HARDWARE ON MEMORY INTERFACE............................................................. 2 FIGURE 2: ERROR ACCUMULATION DUE TO TANDEM COMPRESSION .............................................................................. 6 FIGURE 3: COMPRESSION / DECOMPRESSION FUNCTIONAL BLOCKS ............................................................................... 8 FIGURE 4: COLOR TRANSFORM / REVERSE COLOR TRANSFORM BLOCK INTERFACE ...................................................... 9 FIGURE 5: COLOR TRANSFORM / REVERSE COLOR TRANSFORM OPERATION FLOW GRAPH ........................................... 9 FIGURE 6: MEDIAN EDGE DETECTOR (MED) PREDICTOR PREDICTION WINDOW ........................................................... 10 FIGURE 7: PREDICTOR / CONSTRUCTOR BLOCK INTERFACE .......................................................................................... 11 FIGURE 8: PREDICTOR / CONSTRUCTOR OPERATION FLOW GRAPH .............................................................................. 11 FIGURE 9: ENCODED DATA IN THE STREAM .................................................................................................................. 12 FIGURE 10: ENCODED DATA FOR (2, 0, 13, 3) AND K = 2 ............................................................................................... 12 FIGURE 11: GOLOMB-RICE ENCODER FUNCTIONAL BLOCKS ......................................................................................... 13 FIGURE 12: GOLOMB-RICE PARAMETER EXHAUSTIVE SEARCH HARDWARE .................................................................. 14 FIGURE 13: A POSSIBLE GOLOMB-RICE ENCODER HARDWARE ..................................................................................... 15 FIGURE 14: A POSSIBLE GOLOMB-RICE DECODER HARDWARE ..................................................................................... 16 FIGURE 15: HW-COST VS. NUMBER OF INPUT SAMPLES (N) ........................................................................................... 19 FIGURE 16: HW-COST VS. NUMBER OF PARAMETERS (K) .............................................................................................. 20 FIGURE 17: HW IMPLEMENTATION OF THE NEW COMBINED METHOD ........................................................................... 21 FIGURE 18: ILLUSTRATION OF MODULAR REDUCTION .................................................................................................. 24 FIGURE 19: CALIC GAP PREDICTION WINDOW ............................................................................................................ 29 FIGURE 20: COMPRESSOR BLOCK ................................................................................................................................. 31 FIGURE 21: MEMORY MAPPING AND CORRESPONDING PIXELS OF THE IMAGE .............................................................. 33 FIGURE 22: TRAVERSAL IN PREDICTION WINDOW ......................................................................................................... 34 FIGURE 23: ADDRESS GENERATOR I INTERFACE ........................................................................................................... 35 FIGURE 24: ADDRESS GENERATOR I HARDWARE DIAGRAM ......................................................................................... 35 FIGURE 25: COLOR TRANSFORM HARDWARE DIAGRAM ............................................................................................... 36 FIGURE 26: PREDICTION REGISTER FILE CONTROLLER INTERFACE............................................................................... 37 FIGURE 27: CHANGE OF PREDICTION WINDOW FOR PIXELS OF ONE SUBTILE ................................................................. 37 FIGURE 28: STATES AND REGISTER INPUT CONNECTIVITY IN PREDICTION REGISTER FILE CONTROLLER...................... 38 FIGURE 29: MED PREDICTION HARDWARE FOR BOTH PREDICTOR AND CONSTRUCTOR ................................................ 39 FIGURE 30: PREDICTOR BLOCK HARDWARE DIAGRAM.................................................................................................. 40 FIGURE 31: ENCODER REGISTER FILE CONTROLLER BLOCK INTERFACE ....................................................................... 41 FIGURE 32: GOLOMB-RICE ENCODER BLOCK DIAGRAM ................................................................................................ 42 FIGURE 33: K- PARAMETER ESTIMATION HARDWARE .................................................................................................. 44 FIGURE 34: GOLOMB-RICE ENCODER REALIZATION ..................................................................................................... 46 FIGURE 35: P3 BLOCK, BASIC HARDWARE REALIZATION ............................................................................................... 48 FIGURE 36: PACKED DATA ORDER FORMAT IN THE MEMORY ........................................................................................ 48 FIGURE 37: DATA PACKER ............................................................................................................................................ 49 FIGURE 38: DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ............................................................. 50 FIGURE 39: CONTROL PATH BLOCK INTERFACE ............................................................................................................ 50 FIGURE 40: OVERALL COMPRESSOR ............................................................................................................................. 51 FIGURE 41: DECOMPRESSOR BLOCK ............................................................................................................................. 52 FIGURE 42: SOURCE MEMORY ADDRESS GENERATOR BLOCK INTERFACE ...................................................................... 53 FIGURE 43: REVERSE COLOR TRANSFORM HARDWARE DIAGRAM ................................................................................ 54 FIGURE 44: CONSTRUCTION REGISTER FILE CONTROLLER INTERFACE ......................................................................... 55 FIGURE 45: STATES AND REGISTER INPUT CONNECTIVITY IN CONSTRUCTION REGISTER FILE CONTROLLER ................ 56 FIGURE 46: CONSTRUCTOR BLOCK HARDWARE DIAGRAM ............................................................................................ 57 FIGURE 47: DECODER REGISTER FILE CONTROLLER BLOCK INTERFACE ....................................................................... 58 FIGURE 48: GOLOMB-RICE DECODER HARDWARE ........................................................................................................ 58 FIGURE 49: DATA UNPACKER INTERFACE AND BLOCK DIAGRAM ................................................................................. 59 FIGURE 50: READ / WRITE ADRESSES FROM/TO DESTINATION MEMORY TO CONSTRUCT ONE SUBTILE ......................... 60 FIGURE 51: ACTUAL ADDRESSING SCHEME FOR DESTINATION MEMORY ADDRESSES .................................................... 60 FIGURE 52: DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ............................................................. 61
viii
FIGURE 53: OVERALL DECOMPRESSOR ......................................................................................................................... 62 FIGURE 54: VERIFICATION FRAMEWORK FSM ............................................................................................................. 63 FIGURE 55: FUNCTIONAL VERIFICATION FRAMEWORK ................................................................................................. 64 FIGURE 56: ONE BLOCK OF N VALUES ........................................................................................................................... 75 FIGURE 57: OVERLAP REGIONS OF CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ........................................ 77 FIGURE 58: OVERLAP REGIONS BETWEEN LENGTH FUNCTIONS L1, L2, L3, L4 ............................................................. 78 FIGURE 59: OVERLAP REGIONS FOR N=4 AND K= {0, 1, 2, 3, 4, 5, 6} WITH RESPECT TO ET ........................................... 79 FIGURE 60: REQUIRED COMPARISONS OF OVERLAP REGIONS FOR N=4, K= {0, 1, 2, 3, 4, 5, 6} BASED ON ET ................ 80 FIGURE 61: OVERLAP REGIONS OF NON-CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ............................... 81 FIGURE 62: MOTIVATION BEHIND REMAINDER-BASED CORRECTION............................................................................. 83
ix
List of Tables
TABLE 1: ENCODED OUTPUT LENGTHS FOR EACH K-PARAMETER .................................................................................. 14 TABLE 2: LOGIC COST OF FUNCTIONAL BLOCKS .......................................................................................................... 17 TABLE 3: HW COST COMPARISON OF EXHAUSTIVE SEARCH AND NEW COMBINED METHOD .......................................... 22 TABLE 4: ESTIMATION INTERVALS ACCORDING TO SUM OF INPUTS............................................................................... 23 TABLE 5: HW COST AND COMPRESSION RATIO OF ESTIMATION METHOD ...................................................................... 23 TABLE 6: COMPARISON OF COMPRESSION PERFORMANCES ........................................................................................... 27 TABLE 7: COMPRESSOR BLOCK INTERFACE PORT DESCRIPTION................................................................................... 32 TABLE 8: SOURCE MEMORY ADDRESS GENERATOR ADDRESSING SCHEME .................................................................... 34 TABLE 9: ESTIMATION FUNCTION ................................................................................................................................. 45 TABLE 10: HEADER FORMAT GENERATED BY GR_CTRL BLOCK.................................................................................... 47 TABLE 11: DECOMPRESSOR BLOCK INTERFACE PORT DESCRIPTION ............................................................................ 53 TABLE 12: DESTINATION MEMORY ADDRESS GENERATOR ADDRESSING SCHEME ......................................................... 61 TABLE 13: COMPRESSOR SYNTHESIS RESULT ............................................................................................................... 65 TABLE 14: DECOMPRESSOR SYNTHESIS RESULT........................................................................................................... 66 TABLE 15: CHARACTERISTICS OF DIFFERENT HARDWARE IMPLEMENTATIONS .............................................................. 68 TABLE 16: COMPARISON OF COST ESTIMATIONS AND ACTUAL SIZES FOR BLOCKS ........................................................ 70
x
1
Chapter 1
1 Introduction
A major bottleneck, for performance as well as power consumption, for graphics hardware in
mobile devices is the amount of data that needs to be transferred to and from memory. In, for
example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large
and frequent color buffer data transfers. Therefore it is important to minimize the amount of color
buffer data.
In a graphics hardware block (for example image composition, 3D graphics rasterization), color
data is typically processed using RGB color format. Depending on the color resolution of the
image 8, 12, 16, or 32 bits could be used to represent one pixel. For both 3D graphic rasterization
and image composition several pixels needs to be read from and written to memory to generate a
pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which
impacts both performance and power consumption.
One way of reducing the memory bandwidth required is to compress the color data before writing
it to memory and decompress it before using it in the graphics hardware block. Figure 1 shows
the location of compressor/decompressor hardware with respect to graphics hardware block and
memory. The compressor/decompressor hardware will help reduce the data traffic on memory
interface shown with arrows in the figure. The reduction on the memory bandwidth can be used
to minimize power consumption (reduced access to memory bus), to increase performance (more
data traffic with the same memory bandwidth) or a combination of them. Hence, a better trade-off
between power and performance can be found depending on the design constraints.
2
Graphics Hardware Block RAM
Compress
Data
Decompress
Data
Figure 1: Compressor/Decompressor hardware on memory interface
Hardware implementation of such a compressor/decompressor is the subject of this work. Our
thesis - based on a reference color buffer compression algorithm [1] – aims at:
− Evaluation of color buffer compression algorithms with respect to hardware
implementation properties,
− VHDL implementation of a selected algorithm in order to validate the hardware cost
estimations.
Accordingly, the thesis has been carried out in two phases. In the first phase, the following tasks
have been carried out:
− Analysis of the problem and modeling of the reference algorithm,
− Evaluation of the proposed solution with respect to both compression performance and
implementation properties,
− Exploration of algorithmic and hardware optimizations to improve both compression
performance and implementation cost,
− Decision of the final algorithm to be implemented.
The second phase of the thesis work is dedicated to hardware implementation in VHDL and
verification of the algorithm which is decided in the first phase, and completion of the thesis
report.
1.1 Color buffer and graphics hardware
Color buffer refers to a portion of memory where the actual pixel data to be sent to display is
stored. Graphics hardware uses this buffer during rasterization. Depending on the rasterizer
architecture, the access to this buffer can be in different ways. In traditional immediate mode
rendering, each triangle is rendered as soon as they come in. Hence, for every triangle that is
drawn, the related pixel data are written to the buffer unless the triangle is completely hidden. On
the other hand for tiled, deferred rendering architectures, the color buffer is written when a
complete tile (a unit of w h pixels) is finished. Hence only visible color write is performed
which reduces the overall color buffer bandwidth. A more detailed explanation on the topic can
be found at [2].
3
1.2 Color buffer compression vs. image compression
Color buffer data compression, as a specific application of general data compression, shares lots
of similarities with image compression. Consequently, the theory developed for image
compression is well-suited to be used for compressing color buffer data in 3D graphics hardware.
Specifically, correlation between neighboring pixel values is also valid for color buffer data and
can be used as a basis for compression.
On the other hand, there are important differences between color buffer data compression and
image compression. First of all, most of the image compression algorithms in literature have been
developed for continuous-tone still images. Their compression results have been customarily
based on some set of well-known test images. Those images are real (photographic) images and it
is harder to get information about the performance of image compression algorithms on computer
generated images. Secondly and more importantly, most image compression algorithms assume
the availability of a whole and completed image. For example most (if not all) of the state-of-the
art image compression algorithms are adaptive, which can be briefly explained as learning from
the image itself while traversing it in some order. Rasterization in graphics hardware, on the other
hand, is an incremental process. Depending on the rasterizer architecture, the data to be
compressed could be an unfinished scene and it could also be only a part of the whole scene. In a
tiled architecture for example, a tile is the data to be compressed, and the tile size could be too
small to learn from. Hence the success of adaptive image compression algorithms on color buffer
data is not obvious and dependant on the specific rasterizer architecture.
Another difference between our framework and image compression algorithms is the
requirements on the complexity and implementation cost. As mentioned in [1], most of the image
compression algorithms are not symmetric, i.e., compression and decompression take different
times. Moreover, for most of the compression algorithms, the complexity of the forward path
(compression) is discussed, since they aim at the applications where only compression and
storage of the image data is important. The backward path (decompression) is not considered as
critical. However in our case, the compression/decompression must be done “on-the-fly”, i.e. it
has to be very fast so that the hardware accelerator does not have to wait for data. Finally, a
compressor/decompressor for mobile devices has extra requirements on the implementation cost.
Specifically, the size of the hardware block is of prime concern. This prohibits using
sophisticated algorithms that require logic cost and storage (buffering) cost more than what is
affordable in our case.
1.3 Structure of the report
Chapter 1 of the report has given a description of the aim of this thesis work and some
background information about the application area. Chapter 2, starting with an explanation of the
need for lossless compression in our case, gives a thorough analysis of the lossless compression
algorithms considered for this thesis and evaluation of their implementation properties. This
chapter corresponds to first phase of our thesis work. Chapter 3 describes the implementation and
hardware of the compressor/decompressor and presents synthesis results. Chapter 4 includes
concluding remarks and discussion of some possible future work.
4
5
Chapter 2
2 Lossless Compression Algorithms
In this chapter we discuss several lossless color data compression algorithms, their
performances with their hardware implementation properties. Later, we propose a modified
algorithm which is especially effective for compressible images. The chapter ends with a
comparison of compression ratio and cost of those algorithms and some remarks about possible
future improvements.
2.1 Introduction
Lossless image compression is customarily used in specific application areas like medical and
astronomical imaging, preservation of art work and professional photography. It is not surprising
that lossless compression is not used for multimedia in general when one considers its limited
compression performance. The achievable compression ratio varies between 2:1 and 3:1 in
general, which is significantly lower than what lossy compression can offer. Furthermore, in
lossy compression the resulting image quality and desired compression performance can always
be traded-off depending on the requirements.
Considering the disadvantages just mentioned, the usage of lossless compression in 3D graphics
hardware for color buffer data may be objected. However, [1] explains and illustrates the
possibility of getting unbounded errors due to so called tandem compression when a lossy
algorithm is used. Tandem compression artifacts arise when lossy compression is performed for
every triangle written to a tile during rasterization, resulting in accumulation of error. This is a
direct consequence of rasterization being an incremental process. Figure 2 from [1] illustrates the
accumulation of error.
6
Figure 2: Error accumulation due to Tandem Compression
Although, it is possible to control the accumulated error in those cases as suggested in [1], the
resulting image quality may not be acceptable. In our work we employ a conservative approach
(lossless compression) instead, since the resulting compression ratio is sufficient for our
application.
2.2 Theoretical Background of Lossless Image Compression
In image compression applications, there are several algorithms which offer different approaches
for compression of still images. The most famous algorithms are FELICS [3], LOCO-I [4] and
CALIC [5]. According to the better tradeoff between complexity and compression ratio, LOCO-I
was standardized into JPEG-LS [6].
2.2.1 JPEG-LS Algorithm
The idea behind JPEG-LS is to take the advantage of both simplicity and the compression
potential of context models. The error residuals are computed using an adaptive predictor and
Golomb-Rice technique is used for encoding the data. The purpose of having an adaptive
predictor instead of a fixed predictor is that it proposes minor variations of prediction residuals
which lead to a higher compression ratio. It should be noticed that having better prediction result
help efficiently only when the header information is extracted from the compressed stream which
is the case in JPEG-LS. Otherwise the major overhead which degrades the compression ratio is
sending the header information and in that case, improving the predictor cannot help much in
getting higher performance. The reason why non-adaptive algorithms give lower compression
ratio is that they are limited in their compression performance by first order entropy of the
prediction residuals, which in general cannot achieve total decorrelation of the data [6]. As a
consequence, the compression gap between these simple schemes and more complex algorithms
is significant.
LOCO-I algorithm is constructed by three main components. The first component is predictor
and consists of two components of adaptive and fixed. The fixed component does the task of
horizontal and vertical edge detection where dependence on the surrounding samples is through
fixed coefficients. The fixed predictor used in LOCO-I, is a simple median edge detector (MED)
7
predictor and will be explained in subsection 2.3.2. Adaptive component, on the other hand, is
context dependant and does the bias cancellation task because the DC offset is typically present
in context-based prediction [6].
The second component is context model. A more complex context modeling technique results in
higher achievable dependency order. For LOCO-I, the context model is to compute the gradient
of neighboring pixels and then quantize gradients into a small number of equally probable
connected regions. Although in principle, the number of those regions should be adaptively
optimized, the low complexity requirement dictates a fixed number of equally probable regions.
The gradients represent information about the part of the image surrounding a sampling pixel. By
knowing the gradients we can learn the level of activity such as smoothness or edginess around
the sampling pixel. This information governs the statistical behavior of prediction error [6].
For JPEG-LS, the number of contexts is 365. This number represents a suitable trade-off between
storage requirements which is proportional to the number of contexts.
The last component coder is used to encode the corrected prediction residuals. LOCO-I uses
Golomb-Rice coding technique [6, 7] in two different modes as regular mode and run-length
mode. This coding technique is discussed in details in subsection 2.3.3.
There are several different implementation approaches for JPEG-LS algorithm, each of which
uses specific hardware architectures such as parallel, pipeline or a combination of both.
Implementation options include dedicated DSP, FPGA boards, and ASIC. Factors that affect the
choice of platform selection involve cost, speed, memory, size, and power consumption. One of
the very important characteristics of JPEG-LS algorithm is its sequential execution nature due to
the use of context statistics in coding the error residuals in the prediction phase. This
characteristic makes this possible to design parallel pipeline encoder architecture in order to
speed-up the compression. In section 3.6, different hardware architectures and their
implementation result have been discussed.
Compression in a mobile application is limited by the available storage and memory bandwidth.
Therefore, context-based algorithms such as JPEG-LS may not be applicable and their storage
requirement for the context information could be quite high for this application.
2.3 Reference Lossless Compression Algorithm
Our thesis work is based on [1], which gives a survey of color buffer data compression
algorithms and propose a new exact (lossless) algorithm. In this section, we describe a thorough
analysis of this algorithm, the role and hardware implementation cost of its functional blocks.
The result of this analysis serves as the basis for our later work both on algorithmic and hardware
optimizations.
This algorithm, as opposed to more complex adaptive context-modeling schemes like LOCO-I
[4], can be classified as a variant of simplicity-driven DPCM technique by employing a variable
8
bit length coding of prediction residuals obtained from a fixed predictor [6]. To get a better
decorrelation of pixel data, a lossless (exactly reversible) color transform precedes those blocks.
The block diagram of the compressor and decompressor is given in figure 3.
Color transform 34 - Reverse color transform - 34 Prediction 232 - Construction - 232 GR Encoder – k determination 310 - GR Encoder – residual encoding 20 - GR Decoder – residual decoding - - Total 596 266
Table 2: Logic Cost of Functional Blocks
2.4 Golomb-Rice Encoding Optimization
Considering the result given in table 2 it is obvious that the most costly part of the design is the
hardware necessary to find the best k parameter for Golomb-Rice coding. Therefore, in order to
reduce the hardware cost, it is convenient to try reducing the cost of this circuitry.
Two approaches have been considered to reduce the complexity. First one is to use an improved
exhaustive search method which is presented in subsection 2.4.1. The second one is to use an
estimation formula given in [8] and is presented in subsection 2.4.2.
2.4.1 Proposed method for exhaustive search solution
Exhaustive search method to find k-parameter is straightforward to implement, but the
computational cost of this method is large and it increases linearly with the number of k values.
For all k values, the length of the encoded data should be calculated and the k, corresponding to
minimum length is chosen among them by comparison. For example, consider that we have a
block size of n, which indicates the number of inputs to be encoded together and the set k = {0, 1,
2,…, m-1}, where m is variable and depends on the application requirements. The best member of
the set should be selected as the Golomb-Rice parameter
The computational requirements for exhaustive search method can be significantly reduced with
our new solution, while still finding the Golomb-Rice (best k) parameter for a group of input
data. The approach proposed uses a combination of two different ideas.
18
The first idea, which will be referred as “overlap-limited search“, removes the need for
computation and comparison of all the length values for each possible k. It is mathematically
proven that for any given set of input samples {e1, e2, e3,…,en}, depending on their sum, there are
overlap regions only between a fixed limited number of length functions and that only those
length functions need to be computed and compared to get the best k. In other words, not all
possible k values but only a fixed, limited and consecutive subset of them can be candidates of
being the Golomb-Rice parameter of each block. This idea is not limited to hardware
implementations but reduces time-complexity of comparison in software implementations as
well.
The second idea, which will be referred as “remainder-based correction“, eliminates
computational redundancy of performing identical bit additions in calculation of code lengths (Lk)
corresponding to each k. We identify bit additions common to all Lk and save hardware by
performing those additions only once. With another point of view, instead of adding shifted
versions of input data (the quotients) for each k, we first add the inputs only once and then shift
the same sum for each k. This way of calculation however, ignores the effect of remainders on the
sum. To obtain the exact same result, after the addition, a correction is performed for each k by
using remainders of division. Since the correction hardware is much smaller than the adders used
for each k, a significant hardware saving is possible. This idea is only applicable to hardware
implementations of finding the Golomb-Rice parameter (best k-parameter).
To put the solution into perspective, plots in figures 15 and 16 show cost function of three
different implementation which are exhaustive search, the overlap-limited search method, and the
combined method (overlap-limited search and remainder-based correction) with respect to n
(number of input samples) and k (number of candidates for Golomb-Rice parameter) respectively.
In figure 15, the cost function is represented with respect to n (the number of input samples to be
encoded together). It is assumed that the set k = {0, 1, 2, 3, 4, 5, 6, 7} is fixed and the input data
word-length is equal to 8 bits. It can be observed from the plot that the slope of the cost function
of the combined method is ⅓ of the exhaustive search method.
19
Figure 15: HW-cost vs. number of input samples (n)
In figure 16, the cost is shown as a function of the number of members in set k. This plot shows a
very important feature of “overlap-limited search”. The number of comparisons to find the
Golomb-Rice parameter (best k) is fixed and independent of the number of k values to be
compared. Hence, for applications where dynamic range of input data is larger, a larger set of k
values should be used and “overlap-limited search” leads to even more significant reductions in
the complexity of number of comparisons. Audio applications using 16-bit input data is an
example of this case [12].
The result of both figures 15 and 16 shows that the combined solution is cheaper solution.
20
Figure 16: HW-cost vs. number of parameters (k)
Mathematical derivation and data analysis of this proposed method is given in Appendix A.
Our implementation combining both methods and the circuit diagram in figure 17, takes input
bits (A5-A0, B5-B0, C5-C0, D5-D0), eT, k, k+1, k+2 as inputs. eT is obtained by adding input values.
Then the region corresponding to eT is located to find the three k (k, k+1, k+2) values to compare.
The output of the circuit diagram is Lk, Lk+1, Lk+2. These three values are compared by using two
comparators as a final stage to find the best Golomb-Rice parameter.
21
Figure 17: HW implementation of the new combined method