Efficient Coding and Mapping Algorithms For Software-only ... · Here, we give a brief overview to the H.263/H.263+ baseline encoding process. H.263 and H.263+ standards, like the

Efficient Coding and Mapping AlgorithmsFor Software-only Real-Time Video Coding

at Low Bit Rates

Berna Erol, Faouzi Kossentini, and Hussein Alnuweiri

Abstract — This paper presents efficient coding and mapping algorithms that lead to a significant speed improvement in low

bit rate H.263/H263+ video encoding while maintaining high video reproduction quality. First, by exploiting statistical properties

of low resolution and slowly varying video sequences, we reduce significantly the computation times of the most computationally

intensive components of video coding, particularly the DCT, the IDCT, quantization, and motion estimation. We also map some

of the SIMD (Single Instruction Multiple Data) oriented functions onto Intel’s MMX architecture. The developed algorithms are

implemented using our public-domain H.263/H.263+ encoder/decoder software [1].

Index Terms — Low bit rate video coding, real-time video coding, H.263, MMX.

I. INTRODUCTION

The growing interest in digital video applications led academia and industry to work towards developing video

compression algorithms. Consequently, several successful standards have emerged, such as the ITU-T H.261,

H.263, H.263+, ISO/IEC MPEG-1, and MPEG-2. H.263 and H.263+ are low bit rate video coding standards that

are based on H.261. H.263 targets video encoding at rates below 64 kbps (kilo bits per second), providing some

major improvements over the H.261 standard and offering four negotiable modes that provide many tradeoffs

between compression performance and complexity [2]. H.263+ provides twelve additional negotiable modes that

improve compression performance and error resilience, enhance performance over packet-switched networks, allow

the use of scalable bit streams, and provide supplemental display and external usage capabilities [3]. While the

advanced video coding algorithms, such as those compliant with H.263 and H.263+, provide better compression

performance levels than those compliant with H.261 and MPEG-1, they are more complex and more

computationally demanding.

Until recently, real-time video encoding and decoding were only possible using Application Specific Integrated

Circuits (ASICs) or multiple DSP platforms, resulting in coding systems such as the VDSP2 from Matsushita

2

Electric [4], the DVxpert chips from C-Cube Microsystems [5], and the MPEGE422 encoder chips from IBM [6].

Although ASICs generally offer the best speed performance, they are limited by their inflexible hardware

structures. They also increase the system cost by taking additional board area and requiring additional wiring to the

memory and I/O subsystems. Due to the limitations of ASIC solutions and the growing interest in computationally

intensive multimedia applications, many of the general purpose processors now have multimedia extensions. These

extensions increase the speed of computations by supporting a single instruction multiple data (SIMD) mode of

execution where an operation is performed on multiple data concurrently. Examples of such architectural extensions

are the VIS extension of Sun’s Ultra Sparc [7], the MAX-2 extension of Hewlett-Packard’s PA-RISC [8] and the

MMX extension of Intel’s Pentium [9]. These enhancements help greatly making real-time video coding in software

a reality.

In this paper, we propose two methods to increase the speed of video coding, making possible interactive real-

time video applications on the Pentium MMX general purpose processor. In the first method, we propose platform

independent efficient video coding algorithms that reduce the number of computations by exploiting statistical

properties of low bit rate video coding. More specifically, we obtain a significant reduction in Discrete Cosine

Transform (DCT) and quantization computations by predicting blocks where most or all quantized DCT

coefficients are equal to zero. The number of inverse DCT computations is also reduced by utilizing the zero sub-

block dominant structure of coarsely quantized blocks. We employ predefined look-up tables to eliminate the

quantization operations. Finally, we propose algorithms to reduce the computation demands of the sum of absolute

differences (SAD) and half pixel motion vector search operations. The second method employed in this paper is

processor-specific, where we map several SIMD (Single Instruction Multiple Data) oriented components of

H.263/H.263+ baseline video coding onto Intel’s MMX architecture. Such components include the DCT,

interpolation, SAD, and half pixel motion vector search. All the algorithms and techniques proposed in this paper

are implemented within our public-domain H.263/H.263+ encoder/decoder software [1]. The resulting H.263 video

encoder implementation is approximately 2 times faster than our unoptimized public-domain encoder

3

implementation. Moreover, the optimized video encoder can encode QCIF video sequences at 15 frames per second

on a Pentium MMX 200 MHz processor while maintaining a high video reproduction quality.

The rest of the paper is organized as follows. Section II provides an overview of the H.263 and H.263+ low bit

rate video coding standards. Section III describes the platform independent algorithms. In Section IV, a brief

introduction to Intel’s MMX architecture is given and the mapping of several computationally intensive algorithms

onto MMX and associated performance results are presented. Finally, experimental results that illustrate the

resulting speed performance tradeoffs and conclusions are given in Section V and Section VI, respectively.

II. H.263 AND H.263+ VIDEO CODING

H.263 and H.263+ are both ITU-T standards that are based on the former ITU-T H.261 standard. They employ

a hybrid video coding method in which inter picture prediction is used to reduce temporal redundancies and

transform coding of the motion compensated prediction error data is used to reduce spatial redundancies [10][11].

H.263 provides some new components and methods that improve the rate-distortion performance over H.261,

including half pixel motion vector compensation, modified variable length coding, more efficient motion vector

prediction, and modification of the quantizer for each macroblock [12]. The baseline coding of H.263+ is the same

as that of H.263 [13].

Here, we give a brief overview to the H.263/H.263+ baseline encoding process. H.263 and H.263+ standards,

like the other video coding standards, define only the bit stream syntax and the decoding process. The precise

definitions of compliant video encoding algorithms are given in the Test Models. Figure 1 shows a simplified block

diagram of the H.263/ H.263+ baseline encoder as defined in the TMN 11 [14]. Each picture in the input video

sequence is divided into macroblocks, which consists of four 8x8 luminance blocks and two 8x8 chrominance

blocks, Cb and Cr, as shown in Figure 2. First, an integer pixel motion vector (MV) is determined by performing

motion estimation (ME) for the current 16x16 luminance block. Because of its low computational complexity and

performance, the Sum of Absolute Difference (SAD) is the most commonly used method for determining the best

matching blocks in motion estimation. The SAD operation is defined as follows

4

= =++=

M

1m

N

1n

Rjn,im

Rn,m XX

MN

1SAD ,

where N and M are the dimensions of the block, XRm,n is the value of the block element located in row m and

column n, and XRm+i,n+j is the value of the block element located in row m+i and column n+j in the reference

(previous) picture.

After motion compensation, if the difference between the current block and the predicted one is large, the four

luminance and the two chrominance blocks in the macroblock are intra coded by performing 8x8 DCT transform,

quantization, and entropy coding using Variable Length Codes (VLCs). If the difference between the current block

and the predicted one is small, then half pixel motion search is performed around the current integer pixel motion

vector. In order to find the half pixel motion vector, bilinear interpolation, which is shown in Figure 3, is performed

on the previous picture. After the motion compensation using the half motion vector, the difference between the

current block and the predicted block is DCT transformed, and the resulting coefficients are quantized and VLC

coded. Information bits for the motion vectors are also added to the bit stream. As in Differential Pulse Code

Modulation (DPCM), the difference picture is decoded, and the reconstructed picture is added to the current

predicted picture in order to be used during the prediction of future pictures.

III. EFFICIENT LOW BIT RATE VIDEO CODING ALGORITHMS

There have been many research efforts that are aimed at reducing the complexity and computation time of video

encoding and decoding. Most of these efforts have concentrated on developing efficient ways of performing motion

estimation [15]-[19] and computing DCT/IDCT [20][21]. Moreover, Senda et. al. proposed a method for fast half

pixel motion estimation in [22] within an MPEG-2 coding framework. Froitzheim et. al. [23], Lengwehasatit et. al.

[24], and McCanne et. al. [25] suggested techniques to reduce the number of computations of IDCT. Methods for

reducing the DCT computation time using the sparseness of the quantized coefficients were suggested by Girod et.

al. [26], Yu et. al. [27], and Lengwehasatit et. al. [28].

As part of our H.263/H.263+ baseline encoder implementation, we employ a fast motion estimation (ME)

algorithm that is described in the H.263+ Test Model [14]. Using this algorithm, only 10-15 16x16 SAD operations

5

are required per macroblock, yet the resulting quality is almost the same as that of the full search algorithm [15].

We also employ the fast DCT/IDCT algorithm proposed by Arai et. al. [20], which requires 144 multiplication and

464 addition operations to compute the 2-D DCT/IDCT of an 8x8 block. This algorithm performs a small number

of consecutive multiplications, which is important since the algorithm is also implemented in fixed point arithmetic.

Moreover, the algorithm has a scalable structure where 64 of the multiplications can be performed after the actual

transform, and therefore, they can be combined with the quantization operation. This feature can be exploited when

the H.263+ optional mode that employs quantization without a dead-zone is enabled [3]. Table 1 shows the relative

computational costs of some of the H.263/H.263+ baseline encoder components. These results were derived by

profiling our public-domain H.263/H.263+ software with the fast ME, fast DCT, and fast IDCT algorithms

enabled and without using rate control. As shown in Table 1, further increasing the computational efficiency of

integer ME, half pixel ME, DCT, IDCT, and quantization modules would decrease the overall encoding time

significantly. It is possible to reduce the number of computations associated with such modules by exploiting the

statistical properties of slowly varying, low resolution video sequences. In this section, we present efficient

algorithms for the computation of the DCT, IDCT, and SAD. We also present more efficient ways of performing

quantization and half pixel motion estimation.

A. Zero Block Prediction Prior to DCT

The DCT is one of the most computationally intensive components of the H.263/H.263+ baseline encoding.

Although using fast algorithms reduce greatly the number of computations needed for the DCT operation, it still

requires 15% of the encoding time when the fast search algorithm described in [15] is used for motion estimation.

In a typical H.263/H.263+ application, the motion-compensated prediction difference blocks are coarsely

quantized, resulting in zero blocks, i.e. blocks where all coefficients are equal to zero, more than 50% of the time. If

such blocks can be predicted prior to computing the DCT and quantization, then the corresponding computations

can be eliminated. In this work, four simple predictors were considered to detect the zero blocks. They are

F1 = = =

7

0

7

0, ||

i jjiX ,

6

F2 = = =

7

0

7

0

2, )(

i jjiX ,

F3 = = =

7

0

7

0, ||

i jjiX , and

F4 = = =

7

0

7

0

2, )(

i jjiX ,

where F1 is the sum of absolute values, F2 is the sum of squares, F3 is the mean absolute difference, F4 is the

variance, Xi,j is the value of the pixel located in row i and column j, and is the mean of the block. Figure 4 shows

the distribution of the zero blocks as a function of each of the above predictors. An ideal prediction function would

be the one that separates zero blocks from the non-zero blocks completely. As can be seen from the figure, the sum

of absolute values (F1) and the sum of squares (F2) are better predictors in distinguishing zero blocks from the non-

zero ones. Because of its low computation demands and slightly better performance, the sum of absolute values (F1)

is selected as the predictor. More specifically, if F1 is smaller than an experimentally determined threshold, then the

DCT and quantization operations are skipped. The threshold depends on the quantizer value. Table 2 lists the

statistically found thresholds for each quantizer. The table also shows the percentage of the blocks that result in

zero blocks after DCT and quantization, , along with the percentage of the blocks that are predicted as zero

blocks, . The and parameters strongly depend on the input sequence, while the threshold is very stable for a

given quantizer.

The predictor function F1 should be applied to each block, incurring certain constant overhead in computations.

Also, even when a block is predicted to be zero, it may still require additional operations, such as filling in the array

of coefficients with zeros. Thus, the total number of operations required for the DCT and quantization is equal to

C = X (1- ) + Y+ Z,

where X is the number of DCT and quantization computations, Y is the number of computations required for

predicting zero blocks, Z is the number of computations required to process a zero block after it has been predicted,

and is the percentage of blocks which are predicted to have all zero coefficients. The overall computation gain

7

depends strongly on the relative values of X, Y, and Z. The parameters X, Y, and Z depend, in turn, on the

implementation of each function. For example, if a system implements the DCT function in a sub-optimal way, e.g.,

where MMX instructions are not supported by the processor, and the cost of predicting zero blocks is very small,

e.g., where the processor have a dedicated instruction for this operation, then using the proposed DCT prediction

method is clearly very advantageous. On the other hand, if the DCT algorithm is implemented in a very efficient

way and runs on a dedicated hardware, and predicting the zero blocks takes as much time as computing the DCT,

using the above prediction method is not very useful. Nevertheless, in general, computing the DCT takes much

more time than predicting the zero blocks. This is coupled with the fact that in the low bit rate coding more than

50% of the blocks result in zero blocks after DCT and quantization, using the proposed prediction method provides

very significant speed improvements in most of the low bit rate video encoding systems.

B. IDCT

The IDCT is performed both at the encoder and the decoder. If the IDCT is computed only for the non-zero

blocks, which are indicated by the Coded Block Pattern (CBP) in the H.263/H.263+ bit stream, approximately 70-

80% of the IDCT and the dequantization operations can be avoided. It is possible to further reduce the number of

IDCT computations by exploiting the sparseness of the remaining non-zero blocks. In our implementation, we

reduce the number of multiplications and additions necessary to compute an IDCT by detecting zero sub-blocks in

these non-zero blocks. Figure 5 shows the number of computations necessary for our adaptation of the fast IDCT

from Arai et. al. [20] for different zero sub-block combinations. Depending on the input sequence and the quantizer

value, approximately 80% of the blocks conform to the structure shown in Figure 5.c. By detecting only this

structure, assuming a multiplication costs the same number of cycles as an addition, the IDCT computation time

can be reduced by approximately 40% (50% x 80%). However, there is an overhead involved in finding which of

the sub-blocks is zero. Nevertheless, this overhead can be eliminated by extracting the required information during

variable length coding of coefficients at the encoder side and variable length decoding of the bit stream at the

decoder side. Since the blocks are scanned in a zigzag order prior to VLC, it is not possible to exactly identify the

structure shown in Figure 5.c, but we can instead identify the structure shown in Figure 6. In a low bit rate video

8

coding application, approximately 70% of the 8x8 blocks conform to this structure. This corresponds to a saving of

35% in the IDCT computation time.

C. Quantization

Quantization does require a significant number of computations, representing approximately 8% of the encoding

time. Quantization of some of the macroblocks can be eliminated if zero block prediction prior to DCT is employed.

In this section, we employ a simple technique for faster quantization which can be used when quantization is

necessary.

In a H.263/H.263+ baseline coder, a clipping operation is performed before both quantization and

dequantization, and there are 31 predefined quantization levels. These properties make it possible to use a look-up

table for faster quantization at the expense of additional memory. The look-up table is constructed as a 2D array

and the array indices are the coefficient and quantizer values. The 2D array requires 253,952 bytes for the

quantizer and half of that size for the dequantizer if 16-bit short integer registers are used. Employing a look-up

table increases the quantization speed by approximately 3 to 4 times. Address mapping to the look-up table can also

be made efficient. In H.263/H.263+ coding, the quantizer value is constant within a macroblock even if rate control

is enabled. Thus, in order to use the cache more efficiently, it is better to have the coefficient values stored

sequentially in memory for each quantizer value. More specifically, representing the array as

Table[quantizer][coefficient] in C results in higher quantization speed than using a Table[coefficient][quantizer]

format, which causes many irregular memory accesses.

D. Partial SAD Computation

The sum of absolute difference (SAD) computation is the most commonly used measure to determine the best

matching block during the motion estimation (ME) process. Since it is computationally very intensive, most of the

fast ME algorithms aim at decreasing substantially the number of SAD computations.

A common SAD computation reduction technique is to compare the partially accumulated SAD value to the

minimum known SAD. The computation is terminated if the partial SAD is greater than the minimum SAD. Using

9

this technique with full search reduces the number of absolute difference operations to approximately one fourth for

a 16x16 block. We further reduce the number of operations by predicting whether the SAD of a given block will

exceed the minimum SAD ahead of time, thus eliminating the need to perform actual computations. For example,

during the SAD computation for a 16x16 block, if after processing the second row of the block, the partially

accumulated SAD is already half of the minimum SAD, it is then highly likely that the final SAD will exceed the

minimum SAD. The underlying prediction is clearly a function of the SAD value computed up to the current row,

the number of rows processed, and the total number of rows. A fairly effective prediction model can be given by

I

SINSP )(+= ,

where P is the predicted SAD, S is the partially accumulated SAD, I is the number of rows processed so far, N is

the dimension of the ME block, and is the accuracy coefficient. The number of absolute difference computations

per SAD for a 16x16 block and the change in peak signal to noise ratio (PSNR) for different values are presented

in Table 3. It can be concluded from the table that selecting equal to 0.5 yields a good compromise between

picture quality and computation time.

The above SAD prediction is not very effective when it is used in conjunction with the fast search ME

algorithms, because, in such fast search algorithms, the minimum SAD and the partial SAD values are usually very

close. In this case, trying to predict the SAD may even increase the computation time due to the computation

overhead incurred by prediction.

E. Fast Half Pixel Motion Estimation Based on Approximations

Half pixel motion estimation is one of the main enhancements of H.263 over H.261. It provides, in most cases,

more than a 2 dB PSNR increase in picture quality. Since half pixel ME requires bilinear interpolation of the

reconstructed picture and computation of 8 SAD values for each macroblock, it increases both the encoding time

and complexity. The eight SAD computations needed for half pixel ME constitute a rather insignificant

computational load compared to the 256 SAD computations needed for full search ME. However, in comparison to

10

the fast search algorithm described in the H.263+ Test Model [14], which uses 10 SAD computations in average,

the number of half pixel SAD computations becomes significant.

1) A Simplified half pixel ME for MPEG-2 Encoder

There have already been some efforts to reduce or eliminate the computation time for half pixel ME. Senda et.

al. [22] proposed a simple approximation technique to remove the need for interpolation and the SAD computations

performed for half pixel ME within an MPEG-2 coding framework. They suggest to compute each half pixel SAD

from the surrounding integer pixel SADs as illustrated in Figure 7.a by using the following approximations

a= VH ( A + B + C + D +2) / 4,

b= H ( C + D +1) / 2, and

c= V ( B + D +1) / 2,

where a, b, and c are the predicted half pixel SADs that correspond to the points a, b, and c respectively, A, B,

C, D are the precomputed integer pixel SADs that correspond to the points A, B, C, and D respectively, and VH,

H , and V are statistically optimized coefficients for high bit rate (>1 Mbit/sec) MPEG2 encoding. It is possible

to adapt this technique to low bit rate video coding (<64 Kbit/sec) by re-optimizing these coefficients. Re-

optimizing is done by encoding a large variety of low bit rate video sequences and selecting the coefficients that

result in the best estimations of half pixel motion vectors. The optimized coefficients are given in Table 4.

2) Fast Search ME and Half Pixel ME via Approximation

To perform half pixel ME approximation that is described in the previous section, all the surrounding integer

pixel SADs must be known. However, in many fast search ME algorithms, not all the surrounding SADs are

available. One solution is to simply compute the SADs for the missing locations. Alternatively, the approximation

equations and the coefficients VH, V, and H can be modified so that only the available integer SADs are used.

The fast ME algorithm described in H.263+ Test Model [14] computes the SADs at locations on the vertices of a

diamond. The modified half pixel ME equations and the optimized ’s for this algorithm can be written as

a= VH ( C + B + D +1) / 3, VH = 28/32,

b= H ( C + D +1) / 2, H= 29/32, and

11

c= V ( B + D +1) / 2, V= 29/32.

Although this technique removes the need for interpolation and 8 half pixel SAD computations, the reproduction

quality may not be acceptable in some applications. We propose three methods that improve video reproduction

quality while still offering significant savings in computation time. Our motivation is essentially the same for all

three methods: When a half pixel motion vector is estimated to have the minimum associated SAD by using the

surrounding integer pixel SADs, a better estimation is likely to be found by performing a limited half pixel motion

vector search around that motion vector. The three methods are described next:

1. Method 1: The best matching block is found among the four integer pixel locations that surround the integer

motion vector. Then, block matching is performed for three half pixel motion vector candidates which are

determined by the location of the best matching block. For example, in Figure 7.b, if B pixel corresponds to

the best matching block, then the block matching for each of the pixels 1, 2, and 3 is performed and so on. The

half pixel MV is determined by selecting the best matching block among these and the block that corresponds

to the center pixel.

2. Method 2: The prediction used for determining the candidate half pixel motion vectors in the first method can

be improved by using the two best matching blocks out of the four surrounding integer pixel locations.

Subsequently, two or three half pixel block matchings are performed. For instance, in Figure 7.b, if A and B

pixels correspond to the two best matching blocks, half pixel block matching for each of the pixels 1, 2, and 4

is performed. If the B and D pixels correspond to the two best matching blocks, the half pixel block matching

for the pixels 2 and 7 is performed. Then, the best matching block among these pixels and the center pixel

determines the half pixel motion vector.

3. Method 3: In this method, first, the eight half pixel SADs are computed by approximation using the four

surrounding integer pixels and the center pixel as described earlier. Then block matching is performed to find

more accurate SADs for the N pixels that correspond to the smallest approximated SADs. Finally, the best

matching of these N blocks and the block that corresponds to the center pixel are compared to determine the

half pixel motion vector.

12

The tradeoffs of PSNR and speed improvements for each of the above methods are given in Table 5. As can be

seen from the table, all the proposed methods perform similarly. In the first method, with the cost of less than 0.1

dB, 266% speed improvement is achieved by performing 3 SAD and interpolation of 3 points instead of performing

8 SAD and interpolation of 8 points that are required by half pixel motion estimation. The second proposed method

requires 2-3 SAD computations and interpolation of 2-3 points per macroblock, which is less than that of the first

method. However, the loss in PSNR is slightly larger. The last proposed method gives the flexibility of adjusting the

PSNR-speed tradeoff by changing the value of N. N determines the number of SAD and interpolation operations

performed per macroblock. A small N results in a faster execution, with the cost of larger quality degradation.

Figure 8 shows the change in PSNR for different N values. This optimization method is particularly useful for

video encoding systems that support various PSNR-speed profiles.

IV. MMX IMPLEMENTATION OF THE H.263/H.263+ VIDEO ENCODER

In this section, we show how to map some of the compute intensive algorithms of video coding to the Intel

MMX processor. Intel has introduced the Multimedia Accelerator (MMX) extension to their current Intel

Architecture (IA), improving the performance of communication, signal processing, and multimedia applications

[29][30]. Such applications are generally very computationally intensive and usually perform localized recurring

operations with small native data types. Taking these facts into account, Intel has developed an SIMD architecture

where one instruction performs the same operation on multiple data elements simultaneously. MMX introduces four

new data types: three packed data types (packed byte, packed word and packed doubleword) and a 64-bit

quadword, as illustrated in Figure 9. Also, the MMX extension adds 57 new instructions to Intel’s Pentium

instruction set. Since most of the multimedia applications use 16-bit data, the new instruction set is optimized

mostly for the 16-bit data types. MMX also features saturation arithmetic where in the case of an overflow or an

underflow, the operation result saturates to the maximum or the minimum value that the register can hold instead of

being wrapped and setting a carry flag. In order to retain backward compatibility, the MMX registers are mapped

onto the floating point registers, and therefore, MMX instructions can not be mixed with floating point instructions.

13

We next illustrate several mapping techniques for the computationally intensive SIMD structured components of

H.263/H.263+ video coding to improve speed performance. These video coding components include the DCT,

SAD, interpolation, data interleaving for half pixel motion estimation, and motion compensation functions.

A. SAD Computations for Motion Estimation

In H.263 and H.263+, SAD computations are performed on 16x16 or 8x8 blocks that have 8-bit unsigned data

elements. Since the size of each data element is 8 bits, 8 data elements can be processed concurrently using MMX,

yielding a significant speed performance improvement. Intel made available an MMX algorithm [31] that finds the

absolute difference of two unsigned values. We adopt this algorithm to compute SADs of 16x16 and 8x8 blocks.

Computing one SAD for two 16x16 blocks with the MMX instructions requires 64 memory loads, 64 packed

subtractions, 32 packed-OR, 32 unpack, and 63 addition operations. One drawback of using MMX instructions for

SAD computations is that it is inefficient in partial SAD computations. Since four of the additions are saved in the

same register, unpacking these values and adding them together to compute a partial SAD would cause an

unacceptable delay. This is a disadvantage for full ME because most of the SAD computations could be terminated

during the early stages by using a partial SAD computation technique. However, when fast search is used, efficient

use of MMX is possible since performing partial SAD computation is usually not beneficial anyway.

B. DCT

We have chosen to map the DCT algorithm of Arai et. al. [20] onto the MMX because of its required small

consecutive number of multiplications and its regular structure that is suitable for implementation on an SIMD

structure. Because the MMX instructions can perform only integer arithmetic, the floating point algorithm was

modified so that only fixed-point arithmetic is employed. The input to the DCT is an array of signed 8-bit data and

the output is an array of 16-bit signed data. In the intermediate stages of the computations, 16-bit registers are used

to attain the maximum speed at the cost of a very small loss in accuracy. Small DCT errors are usually negligible,

because they do not propagate through the video frames and such loss in accuracy is insignificant as compared to

the loss caused by the coarse quantization process which is common in low bit rate video coding.

14

In our MMX DCT implementation, we performed the DCT on four 1x8 input vectors at a time. In this

implementation approach, the input array must be transposed before applying the DCT. After transposition, the

DCT for the upper four rows of the array is computed, followed by the computations for the remaining four rows.

After another array transposition, the DCT is applied again on four rows at a time. Operations other than

multiplications are performed on four 16-bit data simultaneously. The multiplications are performed using 32-bit

data and the corresponding results are immediately downscaled and compressed into 16 bits. As illustrated in

Figure 10, two multiplication operations, one for the low significant part and the other for the high significant part,

are performed first. Then, the results are interleaved so that the higher significant part and the lower significant part

used in the same multiplication are in the same MMX register. 32-bit additions and arithmetic right shift operations

are then executed for downscaling. Last, the four 32-bit data is packed into one 64-bit MMX register using signed

saturation.

The MMX DCT implementation runs approximately four times faster than the floating point implementation on

the same Pentium processor and three times faster than the optimized fixed point C implementation. Since a fixed

point DCT algorithm is used in the MMX implementation, the rate-distortion performance of the encoder is affected

as well. Table 6 shows that the resulting picture quality in terms of PSNR for different bit rates and sequences is

almost unchanged for both DCT implementations.

C. IDCT

Since the H.263 and H.263+ standards are based on inter picture prediction, any error caused by the IDCT’s

low accuracy propagates throughout the subsequent video frames. ITU-T has an IDCT accuracy measurement

procedure within the H.263 and H.263+ standards. According to our research, it is not possible to implement a

H.263/H.263+ standard compliant IDCT function using a 32-bit or less precision. Even if the IDCT is implemented

with 32-bit precision using MMX, only two data can be processed simultaneously and because of the large

computational overhead, a significant increase in speed can not be expected. Intel has implemented and made

publicly available an MMX IDCT routine for MPEG decoding. Their implementation uses 32-bit precision for

multiplication and 16-bit precision for accumulation operations, and is approximately 4 times faster than the

15

optimized C implementation. However, it does not meet the specifications of the ITU-T standard. The ITU-T IDCT

accuracy specifications require that, the DCT and the IDCT, with 64-bit floating point accuracy, be applied to a

certain number of blocks that are generated by a pseudo random number generator routine, and then the peak error,

mean square and mean errors (pixel-wise and overall) should be less than certain predefined thresholds. Also, if the

reference IDCT produces a zero output, the IDCT under test should also produce a zero output. More details on

these requirements can be found in [3]. Even if an implementation does not meet the standard, it may be beneficial

to compare its computational accuracy to such thresholds. Table 7 shows the partial (only for numbers that are in

the range of [-256, +255]) accuracy results for 32-bit floating point, 64-bit integer and 32-bit integer IDCT

implementations along with Intel’s MMX IDCT implementation.

The IDCT accuracy is more important for H.263 and H.263+ than for MPEG. In a H.263/H.263+ application,

only one out of each 132 macroblocks has to be intra coded, whereas in an MPEG application a frame is typically

intra coded every 10-15 frames. Nevertheless, Intel’s IDCT implementation can still be used in our H.263/H.263+

codec implementation provided that the encoder/decoder software runs on processors having the same architecture.

However, in that case, the codec would not be truly standard compliant.

D. Interpolation

Interpolation is performed on every reconstructed I and P picture in order to perform motion estimation in half

pixel accuracy. Figure 3 explains the bilinear interpolation performed as specified in the H.263 and H.263+

standards. Since the interpolation is performed on 16-bit data, it is possible to achieve a significant speed

improvement by processing 4 units of data simultaneously. Figure 11 shows the inner core of the interpolation

function that is implemented using MMX instructions. First, the four 1x8 pixel vectors are loaded into the MMX

registers and the data that is going to be processed is separated by using unpacking instructions. Next, packed

additions are performed and the results are written into 16-bit registers. Last, the 8-bit results are packed together

again and written back to the memory. In the outer loop, this operation is repeated for the pixels in the row below

until the bottom line of the picture is reached. Then, starting from the top, the same procedure is repeated for the

16

next 8 pixels in the horizontal direction. The implemented MMX interpolation algorithm is three times faster than

the optimized C implementation.

E. Other MMX optimizations

We have optimized several other computationally intensive components of our implementation and achieved

additional speed improvements. These components include, data interleaving for half pixel ME, motion

compensation, which consist of simple addition operations of reconstructed residual blocks, and memory copy

operations.

V. EXPERIMENTAL RESULTS

In this section, we summarize the speed performance improvements of individual platform independent

algorithms and MMX implementations for each compute intensive module of the software. We also present the

overall H.263/H.263+ baseline video encoding system improvements. The speed performance improvement is

indicated by which is defined as OPUOP E/E= , where EUOP is the execution time of unoptimized module and

EOP is the execution time of the optimized module.

A. Test Sets and Conditions

In our experiments we used a wide variety of test sequences, mostly in QCIF (174x144) resolution. Here,

however, we present performance results for two sequences that, we believe, represent approximately the two ends

of the possible input video sequence spectrum. The first one is Foreman, a high motion video sequence, and the

second one is Akiyo, a low motion news sequence. Additionally, the performance results obtained by combining all

the optimizations are presented for Coastguard and Container video sequences as well. All of these sequences are in

QCIF resolution and consist of 300 frames. The simulations presented here are performed by skipping two frames

for every encoded frame. Therefore, the resulting frame rate is 10 frames per second and the total number of

encoded frames is 100. Each simulation was repeated several times and the numerical results were averaged.

17

B. Speed Performance Improvements of the Proposed Algorithms

Table 8 summarizes the speed performance improvement, , achieved for each module using our proposed

platform independent efficient algorithms. In these simulations, the Akiyo sequence is encoded at 8 kbps, and the

Foreman sequence is encoded at 28 kbps on a Pentium 200 MHz computer. A rate control scheme described in

TMN-11 [14] is employed. As can be seen from Table 8, the change in compression performance in terms of PSNR

is negligible in all cases. Using zero block prediction prior to DCT, it is possible to skip approximately 60-70% of

the DCT and quantization operations at low bit rates. This corresponds to computational time savings of

approximately 60% in our fixed point C implementation where the DCT computation time is relatively large

compared to the computation time of predicting zero blocks, and approximately 15% in the MMX implementation.

Moreover, using our proposed fast half pixel motion estimation technique, great computational gains for both half

pixel ME and interpolation are achieved.

C. Speed Performance Improvements of MMX optimizations

The MMX performance improvement of each video coding module implementation is presented in Table 9. In

the table, the integer pixel ME module includes the SAD computations and memory copy functions. The half pixel

ME module includes the SAD computations and the data interleaving functions. The simulations are performed on a

Pentium 200 MHz computer with MMX support. The number of data elements that are processed simultaneously

in one MMX register (N) is also given in Table 9. There is no compression performance degradation going from the

C implementation to an MMX implementation unless conversion from floating point arithmetic to fixed point

arithmetic is required. In the DCT case, the PSNR difference is in the (-0.02, +0.02) range, which is negligible. The

MMX implementation of the SAD function runs approximately 4-5 times faster than the optimized C

implementation, while maintaining the same rate-distortion performance. A 4:1 speed improvement is achieved in

the data interleaving operations of half pixel ME by processing 8 data in one instruction. Our MMX

implementation of motion compensation, which is performed on 32 bit data, runs 1.5 times faster and memory copy

operations run approximately 1.6 times faster as compared to the optimized C implementations. Note that the

18

simulation results presented above reflect only MMX optimization algorithms. Unlike the platform independent

algorithms, the MMX mapping algorithms yield speed improvements that do not depend on the sequence or the

coding bit rate.

D. Overall Performance Improvements

Table 10 summarizes the speed performance improvements when all of the platform independent algorithmic

optimizations are enabled. The partial SAD computation technique is not employed when the fast ME is used. The

table shows the number of encoded frames per second for the unoptimized code (FUOP) and for the optimized code

(FUOP). The simulations are run both on Pentium 200 MHz and Sun Sparc Ultra 2 168 MHz processors. Rate

control is not employed here. Instead, a constant quantizer (Q) of 18 is used for all frames. The algorithmically

optimized code runs approximately 1.6 times faster than the unoptimized coder. The speed improvement would be

larger at very low bit rates.

When all of the MMX optimization algorithms are employed, approximately a 100% performance improvement

is achieved as shown in Table 11. This might seem unexpectedly low for the full ME case since 75% of all

operations correspond to SAD computations and the MMX-optimized SAD implementation is 4 to 5 times faster

than the C implementation. However, this can be justified in light of the discussion of Section IV.A, which indicates

that the partial SAD computation technique cannot be implemented efficiently using MMX instructions. Therefore,

all the SAD computations are performed until the last row is processed, whereas in the C implementation, SAD

computation is usually terminated after processing the first several rows.

Finally, Table 12 summarizes the overall speed performance improvements when all algorithmic and MMX

optimizations are employed. The resulting encoder can encode more than 15 frames per second (fps) on a Pentium

MMX 200 MHz processor where the unoptimized version of our encoder can only encode 6-7 fps on the same

processor. The change in picture quality as the result of our speed optimizations is very small as shown in Table

13. Moreover, even when some of the low complexity H.263 or H.263+ modes are enabled, such as the PB-frames

and modified quantization modes, it is still possible to achieve a similarly high encoding rate. Finally, note that the

19

speed advantage of our proposed algorithms would have been significantly higher were the control components of

our public-domain encoder efficiently implemented.

VI. CONCLUSIONS

In this paper, we have proposed platform independent efficient coding and MMX mapping algorithms that

increase substantially the speed of low bit rate (<64 Kbps) video encoding. Our algorithms are implemented using

our public-domain H.263/H.263+ video coding software [1]. The resulting H.263/H.263+ baseline encoder can

encode more than 15 fps on a Pentium MMX 200 MHz computer while maintaining a high video reproduction

quality.

REFERENCES

[1] Signal Processing & Multimedia Group, University of British Columbia, “TMN H.263+ encoder/decoder, version 3.0”,

September 1997. http://spmg.ece.ubc.ca.

[2] ITU Telecom. Standardization Sector of ITU, “Video coding for low bit rate communication”, ITU-T Recommendation H.263,

March 1996.

[3] ITU Telecom. Standardization Sector of ITU, “Video coding for low bit rate communication”, ITU-T Recommendation H.263

Version 2, January 1998.

[4] Matsushita Electric, http://eweb.mei.co.jp/

[5] C-Cube Microsystems, http://www.c-cube.com/products/encoders.html

[6] IBM MPEGE422, http://www.chips.ibm.com/products/mpeg/encoderkit.html

[7] VIS of Sun Ultra Sparc, http://www.sun.com/microelectronics/vis/

[8] Hewlett-Packard PA-RISC, http://www.hp.com/

[9] Intel’s MMX Technology Programmers Reference Manual, http://developer.intel.com/drg/mmx/manuals/prm/prm.htm

[10] V. Bhaskaran and K. Konstantinides, Image and video compression standards, Kluwer Academic Publishers, 1995.

[11] K.R. Rao and J.J. Hwang, Techniques & standards for image video & audio coding, Prentice-Hall, 1996.

[12] B. Girod, E. Steinbach, and N. F rber, "Performance of the H.263 video compression standard," Journal of VLSI Signal

Processing, no. 17, pp. 101 - 111, 1997.

[13] G. C t , B. Erol, M. Gallant, and F. Kossentini, “H.263+: Video coding at low bit rates”, IEEE Transactions on Circuits and

Systems for Video Technology, pp. 849-866, November 1998.

20

[14] ITU Telecom. Standardization Sector of ITU, “Video codec Test Model Near-term, Version 11 (TMN 11)”, Q15G16, February

1999.

[15] M. Gallant, G. C t , and F. Kossentini, “A computation constrained block-based motion estimation algorithm for low bit rate

video coding”, IEEE Transactions on Image Processing, pp. 1816-1823, December 1999.

[16] K. Lengwehasatit, A. Ortega, A. Basso, and A. Reibman, “A novel computationally scalable algorithm for motion estimation”,

in Proceedings of VCIP'98, San Jose CA, Jan 1998.

[17] V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards: Algorithms and Architecture, Kluwer Academic

Publishers, Boston, 1995.

[18] J. Jain and A. Jain, “Displacement measurement and its application in interframe image coding," IEEE Trans. on

Communications, vol. COM-29, pp. 1799-1808, Dec. 1981.

[19] K. I. T. Koga, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion-compensated interframe coding for video conferencing," in Proc.

NTC 81, vol. 28, (New Orleans), pp. 239-251, Dec. 1992.

[20] Y. Arai, T. Agui, and M. Nakajima, “A fast DCT-SQ scheme for images”, Transactions of IEICE, pp. 1095-1097, 1988.

[21] C. Loeffler, A. Ligtenberg, and G. Moschytz, “Practical fast 1-D DCT algorithms with 11 multiplications”, in Proceedings of

ICASSP’89, pp. 988-991, 1989.

[22] Y. Senda, H. Harasaki, and M. Yamo, “A simplified motion estimation using an approximation for the MPEG-2 real-time

encoder”, IEEE, pp. 2273-2276, 1995.

[23] K. Froitzeheim and H. Wolf, “A knowledge-based approach to JPEG acceleration”, in Proceedings of IS&T/SPIE Symposium on

Electrical Imaging Science and Technology, February 1995.

[24] K. Lengwehasatit and A. Ortega, “DCT computation with minimal average number of operations”, Proc. of Visual

Communications and Image Processing, pp. 71-82 February 1997.

[25] S. McCanne, M. Vetterli, and V. Jacobson, “Low-complexity video coding for receiver-driven layered multicast”, IEEE Journal

on Selected Areas in Communications, vol. 16, no. 6, pp. 983-1001, August 1997.

[26] B. Girod and K. W. Stuhlmuller, “A content-dependent fast DCT for low bit rate video coding”, In Proceedings of ICIP'98, vol

3, pp. 80-84, Chicago IL, Oct 1998.

[27] A. Yu, R. Lee, and M. Flynn, "Performance enhancement of H.263 encoder based on zero coefficient prediction," in Proceedings

of the Fifth ACM International Multimedia Conference, Seattle, Washington, pp. 21-29, November, 1997.

[28] K. Lengwehasatit and A. Ortega, “DCT computation based on variable complexity fast approximations”, in Proceedings of

ICIP'98, vol 3, pp. 95-99, Chicago IL, Oct 1998.

[29] A. Pleg and U. Weiser, “MMX technology extension to the Intel architecture”, IEEE Micro, pp. 42-50, August 1996.

[30] A. Pleg., S. Wilkie, and U. Weiser, “Intel MMX for multimedia PCs”, Communications of the ACM, pp. 25-38, January 1997.

21

[31] Intel’s MMX application notes, http://developer.intel.com/drg/mmx/appnotes.

22

List of Figures

Figure 1. Block diagram of the H.263/H.263+ baseline encoder................................. .............................. 23

Figure 2. Structure of a macroblock. ................................ ................................ ................................ .......23

Figure 3. Bilinear prediction in half pixel motion estimation................................. ................................ ....23

Figure 4. Zero block statistics for different predictors. (Foreman, Q=13, 10 fps)................................. .....24

Figure 5. Number of computations of a fast IDCT for different sub zero block patterns. .......................... 24

Figure 6. Zero sub-block detection via bit stream parsing. ................................ ................................ .......25

Figure 7. (a) Half pixel motion vector prediction using 8 precomputed integer pixel SADs. (b) Half and integer

pixel locations in a diamond shape area. ................................ ................................ ................ 25

Figure 8. PSNR performance of Method 3 as a function of N (Foreman sequence). ................................ ..25

Figure 9. MMX architecture data types. ................................ ................................ ................................ ..26

Figure 10. Flowgraph of multiplication and downscaling operations with MMX instructions.................... 26

Figure 11. The inner core of MMX implementation of interpolation. ................................ ........................ 27

23

DCT

MC

Integerpixel ME

Half pixelME

IDCT

IQ

Coding Control

VLCQ

0

-+

+

+

Video in

Inter/Intra

Video signal

Control Signals

Reconstruction (Decoder)

Figure 1. Block diagram of the H.263/H.263+ baseline encoder.

Y1

Y3

Y2

Y4 C1 C2

Figure 2. Structure of a macroblock.

a

c

p11b

d

a = p11b = (p11+p12+1-rtype)/2c = (p11+p21+1-rtype)/2d = (p11+p12+p21+p22+2-rtype)/4rtype = rounding type. When it is on, rtype is 1 forevery other picture and 0 for the rest.

p12

p21 p22

Integer Pixel

Half Pixel

Figure 3. Bilinear prediction in half pixel motion estimation.

24

0

2000

4000

6000

8000

10000

0 500 1000 1500 2000F1

num

ber

of b

lock

s

zero blocksnon-zero blocks

0

400

800

1200

1600

0 5000 10000 15000 20000F2

num

ber

of b

lock

s

zero blocks

non-zero blocks

0

2000

4000

6000

8000

10000

0 500 1000 1500 2000F3

num

ber

of b

lock

s

zero blocksnon-zero blocks

0

400

800

1200

1600

0 5000 10000 15000 20000

F4

num

ber

of b

lock

s

zero blocks

non-zero blocks

Figure 4. Zero block statistics for different predictors. (Foreman, Q=13, 10 fps).

56 multiplicationsand 252 additions.




non-zero

non-zero

non-zero

non-zeronon-zero

non-zero

non-zero

zero zero

zero

zero

zero

zero

non-zero

non-zero

non-zero

(a) (b)

(c) (d)

Figure 5. Number of computations of a fast IDCT for different sub zero block patterns.

25


non-zero zero

zero

zerozero

Figure 6. Zero sub-block detection via bit stream parsing.

B

C D

a

b

c

AInteger Pixel

Half Pixel

B

A

D

CE

2 3

4 5

6

1

7 8

(a) (b)

Figure 7. (a) Half pixel motion vector prediction using 8 precomputed integer pixel SADs.

(b) Half and integer pixel locations in a diamond shape area.

PSNR Reduction for Different N Values

-0.15

-0.1

-0.05

0

0.05

0 5 10 15 20 25 30 35

Quantizer

Red

ucti

on in

PSN

R (

dB)

N=1

N=2

N=3

N=6

Low bit ratesHigh bit rates

Figure 8. PSNR performance of Method 3 as a function of N (Foreman sequence).

26

Packed byte - eight elements 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

Packed word - four elements 63 48 47 32 31 16 15 0

Packed doubleword - two elements 63 32 31 0

Quadword - One element 63 0

Figure 9. MMX architecture data types.

Z=Downscale ( Y x A, P)Y: An intermediate variableA: Coefficient to multiplyP: Downscaling precision

Y3 Y2 Y1 Y0

A A A A

XPMULHW

h: highl: low

Y3xA (h) Y2xA (h) Y1xA (h) Y0xA (h) Y3xA (l) Y2xA (l) Y1xA (l) Y0xA (l)

Y3xA (h) Y3xA (l) Y2xA (h) Y2xA (l) Y1xA (h) Y1xA (l) Y0xA (h) Y0xA (l)

PUNPCKLWDPUNPCKHWD

2P-1 2P-1 2P-1 2P-1

+ +

Y3xA+2P-1 Y2xA+2P-1 Y1xA+2P-1 Y0xA+2P-1

PADDD

>> P >>PPSRAD(Y3xA+2P-1)>>P (Y2xA+2P-1)>>P (Y1xA+2P-1)>>P (Y0xA+2P-1)>>P

PACKSSDW

Z3 Z2 Z1 Z0

PMULLW

Figure 10. Flowgraph of multiplication and downscaling operations with MMX instructions.

27

p18 p17 p16 p15 p14 p13 p12 p11p18 p17 p16 p15 p14 p13 p12p19

p15 0 p14 0 p13 0 p120 p110 p14 0 p13 0 p12 0

PUNPCKLBW

p28 p27 p26 p25 p24 p23 p22 p21p28 p27 p26 p25 p24 p23 p22p29

p25 0 p24 0 p23 0 p220 p210 p24 0 p23 0 p22 0

PUNPCKLBW

p15 0 p14 0 p13 0 p120

p110 p14 0 p13 0 p12 0

p25 0 p24 0 p23 0 p220

p210 p24 0 p23 0 p22 0

PADDW+ +

0100 01 00 01 00 01 00

PADDW+ +

0100 01 00 01 00 01 00

= =2b+rtype2b+rtype 2b+rtype 2b+rtype p21+p22+1p24+p25+1 p23+p24+1 p22+p23+1

+

+

0100 01 00 01 00 01 00

=2c+rtype2c+rtype 2c+rtype 2c+rtype

p210 p24 0 p23 0 p22 0

p110 p14 0 p13 0 p12 04d+rtype4d+rtype 4d+rtype 4d+rtype

+

p110 p14 0 p13 0 p12 0

aa a a

Figure 11. The inner core of MMX implementation of interpolation.

28

List of Tables

Table 1. Relative computational costs of the H.263/H.263+ baseline encoder functions............................ 29

Table 2. , , and thresholds for different quantizers. ................................ ................................ .............. 29

Table 3. Change in picture quality and number of absolute difference (AD) operations for different values.

................................ ................................ ................................ ................................ ............. 29

Table 4. The change in PSNR using different approximation factors ( ). ................................ ................ 29

Table 5. PSNR table for different fast half pixel ME methods................................. ................................ .30

Table 6. Performance degradation caused by using an integer DCT implementation. ................................ 30

Table 7. ITU-T IDCT accuracy test results for various IDCT implementations................................. .......30

Table 8. The speed improvements achieved in each module by using the platform independent techniques.30

Table 9. MMX implementation speed performance improvements for each module. ................................ .31

Table 10. Encoding frame rates and speed improvements for all the platform independent optimizations...31

Table 11. Encoding frame rates and speed improvements for all MMX optimizations............................... 31

Table 12. Encoding frame rates and speed performance improvements when all the platform independent and MMX

optimizations are combined (QP=18, no rate control). ................................ ............................ 31

Table 13. Change in PSNR when all the platform independent and MMX optimizations are combined (QP=18, no

rate control). ................................ ................................ ................................ ......................... 32

29

Function Computational Load

Fast IDCT 9%

Integer Pixel ME 19%

Fast DCT 12%

Half Pixel ME 9%

Quantization 8%

Interpolation 5%

Motion Compensation 2%

Table 1. Relative computational costs of the H.263/H.263+ baseline encoder functions.

QP (apprx.) (apprx.) threshold

<9 Too low to consider -

9- 11 0.4 0.75 128

12-14 0.5 0.8 192

15-19 0.55 0.85 256

20-21 0.6 0.9 320

22-25 0.65 0.9 384

26-29 0.7 0.9 448

30-31 0.7 0.9 512

Table 2. , , and thresholds for different quantizers.

Sequence w/o SAD prediction = 0.2 = 0.5 = 0.7 = 1

PSNR (dB) 30.74 -0.04 -0.07 -0.3 -0.95Foreman(48 kbit/sec)

Number of ADs 89 61 40 33 27

PSNR (dB) 37.17 +0.03 0.0 -0.06 -0.13Akiyo(24 kbit/sec)

Number of ADs 54 33 24 22 20

Table 3. Change in picture quality and number of absolute difference (AD) operations for different values.

Foreman48 Kb/sec(Q>13)

Akiyo8 Kb/sec(Q>13)

Foreman64 Kb/sec(Q<13)

Akiyo28 Kb/sec(Q<13)

PSNR with half pixel ME 30.47 32.63 31.56 37.92

PSNR with VH=27/32, V= H=29/32 -0.13 -0.21 -0.21 -0.47

Table 4. The change in PSNR using different approximation factors ( ).

30

Foreman64 Kb/sec

Foreman48 Kb/sec

Akiyo28 Kb/sec

Akiyo8 Kb/sec

with half pixel ME 31.56 30.47 37.92 32.63

no half pixel ME -2.24 -2.16 -1.61 -1.33

Approximation using 5 integer pixels

VH=28/32, V= H=29/32-0.36 -0.31 -0.52 -0.21

Method 1 -0.07 -0.05 -0.04 0

Method 2 -0.12 -0.11 0 -0.02

Method 3 (N=4 pixels) -0.04 -0.02 +0.02 +0.06

Method 3 (N=2 pixels) -0.15 -0.14 -0.05 -0.02

Table 5. PSNR table for different fast half pixel ME methods.

Foreman48 Kb/sec

Akiyo8 Kb/sec

Foreman64 Kb/sec

Akiyo28 Kb/sec

PSNR with floating point DCT 30.47 32.63 31.56 37.92

Change in PSNR with fixed point DCT -0.01 +0.02 -0.01 0.00

Table 6. Performance degradation caused by using an integer DCT implementation.

Peakerror

Zero outputrule violation

Maximum mean errorfor any pixel

Overallmean error

Thresholds 1 No 0.015 0.0015

32-bit floating point fast IDCT 1 No 0.0002 0.0000156

64-bit integer IDCT 1 No 0.0001 0.00000625

32-bit integer IDCT 1 Yes 0.0539 0.0289

Intel’s MMX IDCT 3 Yes 1.866 0.55

Table 7. ITU-T IDCT accuracy test results for various IDCT implementations.

Module Akiyo (8kbps) Foreman (28 kbps)

PSNR (%) PSNR

DCT + Quantization 2.2 0 1.6 -0.01

IDCT 2.7 0 2.3 0

Quantization 3.4 0 3.4 0

Half Pixel ME + Interpolation (Method 3, N=3) 2.8 +0.02 2.9 -0.02

Integer Pixel ME (full search, =0.5) 2.3 -0.05 1.4 -0.07

Table 8. The speed improvements achieved in each module by using the platform independent techniques.

31

Module N

DCT (floating point-fast) 3.8 4

DCT (integer-fast) 2.7 4

Integer Pixel ME (fast search) 2.9 -

Half Pixel ME 3.1 -

Interpolation 3.0 4

Motion Compensation 1.5 2

SAD 4.4 8

Data Interleaving 3.8 8

Memory Copy 1.6 8

Table 9. MMX implementation speed performance improvements for each module.

Pentium 200 MHz Sun Sparc Ultra 2 168 Mhz

Foreman (QP=18) Akiyo (QP=18) Foreman (QP=18) Akiyo (QP=18)

H.263 encoding FUOP FOP FUOP FOP FUOP FOP FUOP FOP

Fast ME 6.4 9.4 1.5 7.5 11.6 1.6 7.4 10.9 1.5 8.7 14.3 1.6

Full ME 2.1 3.0 1.4 2.4 4.1 1.7 2.2 3.0 1.4 2.7 4.0 1.5

Table 10. Encoding frame rates and speed improvements for all the platform independent optimizations.

Pentium 200 MHz

Foreman (QP=18) Akiyo (QP=18)

H.263 encoding FUOP FOP FUOP FOP

Fast ME 6.4 11.5 1.8 7.5 13.5 1.8

Full ME 2.0 3.9 2 2.4 4.0 1.7

Table 11. Encoding frame rates and speed improvements for all MMX optimizations.

Pentium 200 MHz

Foreman (35 kps) Akiyo (6 kps) Coastguard (35 kps) Container (10 kps)

H.263 encoding FUOP FOP FUOP FOP FUOP FOP FUOP FOP

Fast ME 6.4 14.3 2.2 7.5 17 2.3 5.8 12 2.1 6.6 15.1 2.3

Full ME 2.0 4.1 2.1 2.4 4.4 1.8 1.5 3.6 2.4 1.8 3.8 2.1

Table 12. Encoding frame rates and speed performance improvements when all the platform independent and

MMX optimizations are combined (QP=18, no rate control).

32

Pentium 200 MHz

Foreman (35 kps) Akiyo (6 kps) Coastguard(35 kps) Container (10 kps)

PSNR - before thespeed optimizations

31.13 32.58 31.67 31.28

PSNR - after thespeed optimizations

31.10 32.58 31.69 31.33

PSNR -0.03 0.00 +0.02 +0.05

Table 13. Change in PSNR when all the platform independent and MMX optimizations are combined (QP=18,

no rate control).

Efficient Coding and Mapping Algorithms For Software-only ... · Here, we give a brief overview to the H.263/H.263+ baseline encoding process. H.263 and H.263+ standards, like the

Documents