Optimization of H.264 Optimization of H.264 High Profile Decoder High Profile Decoder for Pentium 4 for Pentium 4 Processor Processor Tarun Bhatia Tarun Bhatia University of Texas at University of Texas at Arlington Arlington [email protected][email protected]
40
Embed
Optimization of H.264 High Profile Decoder for Pentium 4 Processor Tarun Bhatia University of Texas at Arlington [email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Optimization of H.264Optimization of H.264High Profile Decoder for High Profile Decoder for
Pentium 4 Processor Pentium 4 Processor
Tarun Bhatia Tarun Bhatia
University of Texas at ArlingtonUniversity of Texas at Arlington
H.264/AVC video coding introduces substantially more H.264/AVC video coding introduces substantially more coding tools and coding options than earlier standards. coding tools and coding options than earlier standards. Therefore, it takes much more computational complexity to Therefore, it takes much more computational complexity to
achieve highest possible coding gain.achieve highest possible coding gain.
Aggressive optimization is typically required in order to get Aggressive optimization is typically required in order to get H.264 implementations to meet cost and power targets and H.264 implementations to meet cost and power targets and
provide real-time performance for applicationsprovide real-time performance for applications..
Sequences UsedSequences Used
Plane.264 Shore.264
Golf.264
Girl.264 Karate.264
H.264 ProfilesH.264 Profiles
I slice P slice
CAVLC
Arbitrary Slice Order (ASO)Frame Macroblock Ordering (FMO)
Redundant Slices
B slicesWeighted Prediction
CABACData Partition
SP Slice
SI Slice
Adaptive Block Size Transform
Perceptual Quantization Matrices
High Profile
Extended ProfileMain Profile
Baseline Profile
H.264 High Profiles - featuresH.264 High Profiles - features
Main Profile + additional featuresMain Profile + additional features
8x8 Integer DCT 8x8 Integer DCT
HVS matricesHVS matrices
8x8 Intra Prediction modes 8x8 Intra Prediction modes
Optimization : LevelsOptimization : Levels Algorithm LevelAlgorithm Level e.g. DCT implementation e.g. DCT implementation
Compiler Level Compiler Level (Microsoft Visual Studio .NET 2003 (Microsoft Visual Studio .NET 2003 / Intel C++ compiler v 8.0) / Intel C++ compiler v 8.0)
Implementation Level Implementation Level e.g. Elimination of Loops, Conditions e.g. Elimination of Loops, Conditions Using SIMD for implementation Using SIMD for implementation MultithreadingMultithreading
Purpose : Simultaneous Execution of ThreadsPurpose : Simultaneous Execution of Threads
SYSTEM BUS
ArchitecturalArchitectural
StateState
ArchitecturalArchitectural
StateState
Execution Engine Execution Engine
Local Local
APICAPIC
LocalLocal
APICAPIC
Bus InterfaceBus Interface
Optimization : StepsOptimization : Steps
Optimization during code developmentOptimization during code development Optimization after code developmentOptimization after code development 1) Searching for “hotspots” in the code1) Searching for “hotspots” in the code
2) Analysis of “hotspot” 2) Analysis of “hotspot” e.g. more number of calls, cache miss, e.g. more number of calls, cache miss,
slower implementationslower implementation 3) Optimization of hotspots3) Optimization of hotspots
movd [ECX+4],mm4 // result[5:8] = mm4 movd [ECX+4],mm4 // result[5:8] = mm4 //Repeat the same process for fb[9:16] and bb[9:16] //Repeat the same process for fb[9:16] and bb[9:16] emms // Empty MMX stateemms // Empty MMX state
}}
}}
SIMD Application ResultsSIMD Application Results Amdahl’s Law : The Overall Speedup (O.S.) obtained Amdahl’s Law : The Overall Speedup (O.S.) obtained
by optimizing a portion p of the program by a factor s by optimizing a portion p of the program by a factor s isis
O.S. = 1 x 100 %O.S. = 1 x 100 % ----------------- - 1----------------- - 1 1 – p + (p/s)1 – p + (p/s)
p p fraction of the code being optimized fraction of the code being optimizeds s speedup factor for that fraction of code speedup factor for that fraction of code
Half Pel InterpolationHalf Pel Interpolation Quarter Pel InterpolationQuarter Pel Interpolation Linear Interpolation for B framesLinear Interpolation for B frames
Motion Compensation-Motion Compensation-% Time consumption (without MMX)% Time consumption (without MMX)
0
5
10
15
20
25
30
35
Girl Golf Karate Plane Shore
ManipulationInterpolation
SIMD Application to SIMD Application to Motion Compensation - ResultsMotion Compensation - Results
Motion Compensation – ResultsMotion Compensation – ResultsComparison of % Time ConsumedComparison of % Time Consumed
0
5
10
15
20
25
30
35
Girl Golf Karate Plane Shore
No SIMDSIMD
% Overall Speed up in % Overall Speed up in Decoding Time with SIMD MCDecoding Time with SIMD MC
0
2
4
6
8
10
12
14
16
Girl Golf Karate Plane Shore
Overall Speedup
MultithreadingMultithreading
DefinitionDefinition : : Multithreading is the ability of the program to Multithreading is the ability of the program to multitask within itself. The program can split itself into multitask within itself. The program can split itself into separate “separate “threadsthreads” of execution that seem to run concurrently. ” of execution that seem to run concurrently.
WaitsWaits are used to block the thread till a particular event are used to block the thread till a particular event hands over controlhands over control
ReleaseRelease is use to unblock the threadis use to unblock the thread
SemaphoresSemaphores : : Locking mechanism / Counters to control Locking mechanism / Counters to control access to shared resources being used by multiple processesaccess to shared resources being used by multiple processes
Producer-Consumer Problem (Diagram)Producer-Consumer Problem (Diagram)
Producer Thread Consumer Thread
SemaphoresWait
Release
Serial ExecutionOf a Thread
Producer-Consumer Problem (Algorithm)Producer-Consumer Problem (Algorithm)
Producer thread starts and initialize dataProducer thread starts and initialize data Wait for the Consumer thread Wait for the Consumer thread If Consumer thread ready, release control to the If Consumer thread ready, release control to the
consumer threadconsumer thread Producer thread completes one execution cycle in the Producer thread completes one execution cycle in the
meantime and waits for Consumer thread meantime and waits for Consumer thread When the control is passed back to Producer thread, When the control is passed back to Producer thread,
the process is repeated till the end condition is met. the process is repeated till the end condition is met.
Multithreading in Video CodingMultithreading in Video Coding
The Codec can be multithreaded in two waysThe Codec can be multithreaded in two ways:-:- Block Level Block Level
Independent blocks can be executed as separate threads Independent blocks can be executed as separate threads e.g. slices in H.264, motion estimation, deblocking of non-e.g. slices in H.264, motion estimation, deblocking of non-reference framesreference frames
GOP LevelGOP Level Closed GOPClosed GOP : Group of frames which will not use any : Group of frames which will not use any
reference frames except from their GOPreference frames except from their GOP Open GOPOpen GOP : Group of frames can use reference frames : Group of frames can use reference frames
GOP Level (Closed GOP)GOP Level (Closed GOP) 30 frames per GOP30 frames per GOP IPPPPPPP…PIPPPPPPP…P Each GOP begins with an I frame and contains Each GOP begins with an I frame and contains
P frames only (i.e. 1 I frame and 29 P frames P frames only (i.e. 1 I frame and 29 P frames in each )in each )
B frames are not used in the design to maintain B frames are not used in the design to maintain closed GOP structureclosed GOP structure
Multithreaded Decoder - ThreadsMultithreaded Decoder - Threads Main ThreadMain Thread
Creates all threads and semaphoresCreates all threads and semaphores Get SPS and PPS NALUs from theGet SPS and PPS NALUs from the Initialize Multiple decoders with SPS and PPS NALUsInitialize Multiple decoders with SPS and PPS NALUs
Get IDR Frame Position ThreadGet IDR Frame Position Thread Search for IDR NALU Position in the bitstreamSearch for IDR NALU Position in the bitstream Manage Waits and Releases of SemaphoresManage Waits and Releases of Semaphores
SPS SPS Sequence Parameter Set Sequence Parameter Set PPSPPS Picture Parameter Set Picture Parameter SetNALU NALU Network Abstraction Layer Unit Network Abstraction Layer Unit
Multithreading - ResultsMultithreading - Results% Speed up in Decoding Time% Speed up in Decoding Time
0
2
4
6
8
10
12
14
16
Girl Golf Karate Plane Shore
2
3
4
Number of
Threads
Multithreading-ResultsMultithreading-ResultsThreading Overhead (Time in seconds) Threading Overhead (Time in seconds)
0
0.05
0.1
0.15
0.2
0.25
Girl Golf Karate Plane Shore
2
3
4
No. of Threads
Further ResearchFurther Research Optimization of High Profile HD (720p) Encoder for Optimization of High Profile HD (720p) Encoder for
minimization of Hardware requirementminimization of Hardware requirement
Testing of the H.264 encoder and decoder on Testing of the H.264 encoder and decoder on multicore CPUsmulticore CPUs
Implementation of time consuming modules of H.264 Implementation of time consuming modules of H.264 encoder and decoder on GPU (Graphic Processing encoder and decoder on GPU (Graphic Processing Unit)Unit)
ReferencesReferences H.264: International Telecommunication Union, “Recommendation ITU-T H.264: H.264: International Telecommunication Union, “Recommendation ITU-T H.264:
Advanced Video Coding for Generic Audiovisual Services,” Advanced Video Coding for Generic Audiovisual Services,” ITU-TITU-T, 2005., 2005. MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information
Technology-Generic Coding of Moving Pictures and Associated Audio Information: Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video,” Video,” ISO/IEC and ITU-TISO/IEC and ITU-T, 1994. , 1994.
Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”.Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”. G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding
Standard: Overview and Introduction to the Fidelity Range Extensions,” Standard: Overview and Introduction to the Fidelity Range Extensions,” SPIE Conference SPIE Conference on Applications of Digital Image Processing XXVIIon Applications of Digital Image Processing XXVII , vol 5558 , page 53-74, Aug 2004., vol 5558 , page 53-74, Aug 2004.
The Software Optimization Cookbook, The Software Optimization Cookbook, Intel Press,Intel Press, 2002. 2002. IA-32 Intel Architecture Optimization, Reference Manual, IA-32 Intel Architecture Optimization, Reference Manual, www.intel.comwww.intel.com Optimization Applications with the Intel C++ and FORTRAN compilers, White paper, Optimization Applications with the Intel C++ and FORTRAN compilers, White paper,
Seoul National University. Seoul National University. http://sips03.snu.ac.kr/pub/conf/c67.pdfhttp://sips03.snu.ac.kr/pub/conf/c67.pdf Accepted at Accepted at IEEE IEEE Asia-Pacific Conference on Circuits and Systems, (APCCAS),Asia-Pacific Conference on Circuits and Systems, (APCCAS), December 2004. December 2004.
Amdahl, G.M. Validity of the single-processor approach to achieving large scale Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In computing capabilities. In AFIPS Conference ProceedingsAFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.
Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, vol.13, no. 7, pp. 704-716, July 2003vol.13, no. 7, pp. 704-716, July 2003 ..