COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING By Sudeep Gangavati Department of Electrical.

Complexity reduction of h.264 using parallel programming

COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMINGBy Sudeep GangavatiDepartment of Electrical Engineering University of Texas at Arlington

Supervisor : Dr.K.R.Rao

OutlineIntroduction to video compressionWhy H.264Overview of H.264MotivationPossible approachesRelated workTheoretical estimationProposed approachParallel computingNVIDIA GPUs and CUDA Programming ModelComplexity reduction using CUDAResultsConclusions and future workIntroduction to video compressionVideo codec: A software or a hardware device that can compress and decompressNeed for compression : Limited bandwidth and limited storage space.Several codecs : H.264, VP8, AVS China, Dirac etc.

Figure 1 Forecast of mobile data usage

Cisco predicts that there is going to be 10 fold increase in the mobile data traffic over the next couple of years. Nearly 70% of this data is going to be mobile video. But the Bandwidth is limited and even the storage is limited, we need to compress the video. There are several codecs used like H.264 from ITU-T, VP8 Dirac. 3Why H.264 ?H.264/MPEG-4 part 10 or AVC (Advanced Video Coding) standardized by ITU-T VCEG and MPEG in 2004.Approximately 50% bit-rate reductions over MPEG-2.Most widely used standard.Built on the concepts of earlier standards like MPEG-2.Substantial compression efficiency.Network friendly data representation.Improved error resiliency toolsSupports various applications

Since there are many video coding standards, we particularly use H.264 because, 50% bit reductions,better error resiliency tools, network friendly data representation, low learning curve.4Overview of H.264There are two parts: Encoder : Carries out intra prediction, motion estimation, transform, quantization and encoding processes to produce a H.264 bit-stream.

Decoder: Carries out the decoding, inverse transform, inverse quantization to reconstruct the earlier encoded video.

H.264 encoder [1]

The very first frame is coded as Intra frame and the subsequent frames are coded as Inter frames. A residual is formed after the prediction, which is then transformed, quantized and encoded.There are two paths: forward and reverse path. The prediction, transform, quantization and entropy coding forms the forward path.Reverse path consists of inv transform and inv quantization, deblocking filter picture buffer etc. because of the prediction reverse path is used6H.264 decoder [2]

Inverse operations of the encoder takes place in the decoder.starts with the entropy decoder etc.7Intra predictionExploit spatial redundancies9 directional modes for prediction of 4 x 4 luma blocks4 modes for 16 x 16 luma blocks4 modes for 8 x 8 chroma blocks

8Intra prediction9 modes for 4 x 4 luma block

4 modes for 16 x 16 luma blocks

Inter predictionExploits temporal redundancyInvolves prediction from one or more previous frames called reference frames

Motion estimation and compensationMotion estimation and compensation is a process of finding a matching blockMotion search is performed.Motion vectors are obtained that provide the displacement in the block.

11Transform, Quantization and EncodingPredicted values are then transformed.H.264 employs integer transform, basically rough approximation of DCTAfter transform, the values are quantized for compressionEntropy encoding : CAVLC / CABACH.264 profiles [1]H.264 provides several profiles for different applications

MotivationPerformed a time profiling [45] on H.264 and obtained :

Motion estimation takes more time than any other module in H.264Need to reduce this time by efficient implementation without sacrificing video quality and bitrate.With reduced motion estimation time, the total time for encoding is reduced.

So far we have seen diff operations in H.264 like Intra prediction, ME, etcWe did a time profiling, and observed that 90% of the time is take by Me and it dominates the other operationsThe many-core processors have evolved significantly enabling us to leverage their processing power and enhance and speed up the executions of our computational problemsThis thesis focusses on using these many-core processors and work on ME to reduce the time it takes for ME14Possible approaches for complexity reductionEncoder optimization Levels :Algorithmic Level : Develop new algorithms similar to Three step algorithm, fast mode decision algorithm etc.Compiler Level : Efficient programming Implementation Level: Using parallel programming using CUDA, OpenMP , utilize underlying hardware etc.

Related workAuthorFeaturesAdvantagesDisadvantages1. Chan et.al [41]Considers pyramid algorithm for the motion estimationConsider motion vector predicted to calculate SAD.1.Video quality degradation2.RD performance is not considered.2.Lee et.al [40] Multi-pass motion estimation. Generates local and global SADs in the first and second passes. Fast ME Search algorithm is used.6 times speed up achieved compared to standard implementation.1.Focus only on speed, not on rate and distortion.2. Threads are invoked for pixels3.Video resolution limit the thread creation3.Rodriguez et.al [42]Considers tree structured motion estimation algorithmThree sequential steps 1. SAD Calculation 2.Uses binary reduction algorithm 3. Cost reduction1.Implementation results in higher bitrate. 2.RD performance is not shown.4. Cheng et.al [44]Based on simplified unsymmetrical multi-hexagon search. Divide into tiles.3x speed up. Thread created for each tile.Penalty in video quality5.NVIDIA EncoderSearching algorithm is not disclosed. No documentation on internal details.Provides 4 times speed up. Very good visual quality.Fixed search range, Issues with previous workFocus only on achievable speed up.Does not consider the methods to decrease the bitrateDoes not consider techniques to maintain video qualityThread creation overhead and limitations in some approaches. Theoretical estimation by Amdahl`s Law [43]We use this law to find out maximum achievable speed up

Widely used in parallel computing to predict theoretical maximum speed up using multiple processors.

Amdahl`s law states that if P is the proportion of a program that can be made parallel and (1-P) is the proportion that cannot be parallelized, then maximum speedup that can be achieved by using N processors is

Using Amdahl`s LawApproximation of speed up achieved upon parallelizing a portion of the codeP: parallelized portionN: Number of processor cores

In the encoder code, motion estimation accounts to approximately 2/3rd of the code .

Applying the law the maximum speedup that can be achieved in our case is 2.2 times or 55% time reduction.

Proposed workWe propose the following to address the problem :Using CUDA for parallel implementation for faster calculation of SAD (sum of absolute differences) and use one thread per block instead of one thread per pixel to address the thread creation overhead and limitation.Use a better search algorithm for motion estimation to maintain the video quality

Combine SAD cost values and save the bitrate

The above methods address all the issues mentioned earlier

Along with the above, we utilize shared and texture memory of the GPU that reduces the global memory references and provides faster memory access.

Parallel ComputingMulti-core and many-core processors improve the efficiency by parallel processing Parallel processing provides significant improvementTechniques to program software on multiple core processors: Data ParallelismTask parallelism

Since, many-core processors work in parallel, lets take a look at different techniqes that are available to us for parallel computing. The given problem has to be decomposed in such a way that it can be processed in parallel. There are maninly 2 types of techniques, 1. Data parallelism 2.Task parallelsim21Parallel ComputingData ParallelismSplit the large data set into smaller parts and execute them in parallel. After the execution, the data are grouped.

Parallel ComputingTask ParallelismDistribute threads to different processorsData could be common May execute same or different code

NVIDIA GPU And CUDA Programming Model NVIDIA pioneered the Graphics Processing Units (GPU) Technology. First GPU: GeForce256 in 1999, had 128 MB of graphics memory.

GPUs, consisting of many core processors, are used in applications requiring high amounts of computation.

CPU-GPU Heterogeneous Model

24Host-Device Connection

Compute Unified Device Architecture (CUDA) [22]NVIDIA introduced CUDA in 2006. Programming model that make programs run on GPU.The serial portions of our program written in C/C++ functions.Parallel portions are written as GPU kernels.C/C++ functions execute on CPU kernels sent to GPU for processing.

Problem decompositionSerial C functions run on CPU CUDA Kernels run on GPU

Hardware ArchitectureMain element : Stream multiprocessor (SM)GT550M series has 2 SMs

Each SM has 48 cores

Each SM is capable ofexecuting 1536 threads

Total of 3072 threads running in parallel

ThreadingThreads are grouped into blocks

Blocks are grouped into grids

All threads within a block execute on the same SM

Complexity reduction using CUDAMotion estimation: Process of finding the best matching block.

We have seen the basic architecture of CUDA, the threadng hierarchy. So lets come back to Motion Estimation.30Complexity reduction using CUDATo find best matching block, search is done in the search window (or region).Search provides the best matching block by computing the difference i.e. it obtains sum of absolute difference (SAD).

Motion vectorSAD (dx, dy) =

Search through search range of 8,16 or maximum 32Select the block with least SAD.Larger the block size, more the computations

Complexity reduction using CUDA

A 352 x 288 frameStandard algorithmDivide the frame into macroblocks of size 16 x 16Further divide these macroblocks into sub-blocks of 8 x 8 . Search through the search areaCompute SAD obtain MVs

Our approachMain idea is to:Minimize memory references and Memory transferMake use of shared memory and texture memoryUse single thread to compute SAD for single block Make thread block creation dependent on the frame size for scalabilitylarge number of threads are invoked that run in parallel and each block of thread consists of 396 threads that compute SADs of 396 - 8 x 8 blocks

SAD mapping to threads

Blocks 352 x 288 : (352/8) * (288 /8) = 1584 blocks that are to be computed for SAD.Total thread blocks = 4. Each block with 396 threads.This makes the approach scalable. For a video with higher resolution, like 704 x 480 ( 4SIF) or 704 x 576 (4CIF), we can create 16 blocks each with 396. So the number of threads created is dependent on video resolution.

Performance enhancementsWe consider Rate-distortion (RD) criteria and employ following techniques: To minimize bitrate: Calculate the cost for smaller sub blocks of 8 x 8 and combine 4 of these and form a single cost for 16 x 16 block.To enhance video quality:Incorporate exhaustive full search algorithm that goes on to calculate the matching block for the entire frame without skipping any blocks as opposed to other algorithms. Previous studies [30] show that this algorithm provides the best performance. Though it is highly computational, this is used keeping video quality in mind.Memory access Memory access from texture memory to shared memory

Memcpy API to move data into the Array we allocated:cudaMemcpyToArray(a_before_dilated, // array pointer0, // array offset width0, // array offset heighth_before_dilated, // source width*height*sizeof(uchar1), // size in bytescudaMemcpyHostToDevice); // type of memcpy

Texture MemoryShared memoryPerformance MetricsQCIF and CIF formats

Test Sequences

ResultsThe CPU-GPU implemented encoder performs better than the CPU-only encoder. But falls short when compared to NVIDIA Encoder. This is due to the fact that NVIDIA Encoder is heavily optimized at all levels of H.264 and not just motion estimation. NVIDIA has not released the type of searching algorithm it is using as well. Use of appropriate algorithm for motion search significantly changes the performance of quality, bitrate and speed.The theoretical speed up was about 2.2-2.5 times. From results, we achieve approx. 2 times speed up. This can be attributed to the other factors like the time it takes for load and store operations for functions , transfer of control to the GPU, memory transfer and references for operations that we have not considered and also other H.264 calculations etc.

Results for QCIF video sequences PSNR vs. Bitrate for Akiyo sequencePSNR vs. Bitrate for Carphone sequence PSNR vs. Bitrate for Container sequence PSNR vs. Bitrate for Foreman sequence

ResultsSSIM provides the structural similarity between the input and output videos. Ranges from 0.0 to 1.0. 0 is the least quality video. 1 is the highest quality video

ResultsSimilar behavior is observed in case of CIF video sequences.

Results for CIF video sequences

ResultsSSIM values for our optimized software and NVIDIA encoder are very close.

ConclusionsNearly 50% reduction in encoding time on various sequences close to the theoretical estimation.Less degradation in video quality is observed.Less bitrate is obtained by uniquely combining the SAD costs of sub blocks into SAD cost of larger macroblockSSIM, Bitrate, PSNR are close to the values obtained without optimizationsAchieved data parallelismWith little modification in the code, the approach is actually scalable to better hardware and increased video resolution

LimitationsAs the threads work in parallel, in case when the sum of SADs till kth row (k

COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING By Sudeep Gangavati Department of Electrical.

Documents

mobile video

prediction reverse path

inverse quantization

video compressionwhy

video coding standards

inverse transform

luma blocks4 modes

earlier encoded video