Overview of the H.264/AVC Video Coding Standard
ThomasWiegand, Gary J. Sullivan,
Gisle Bjøntegaard, and Ajay Luthra
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,VOL. 13, NO. 7, JULY 2003
Goals of the H.264/AVC
Video Coding Experts Group (VCEG), ITU-T SG16 Q.6 H.26L project (early 1998) Target – double the coding efficiency in
comparison to any other existing video coding standards for a broad variety applications.
H.261, H.262 (MPEG-2), H.263 (H.263+, H.263++)
Scope of video coding standardization
Pre-Processing Encoding
DecodingPost-Processing& Error recovery
Source
Destination
Scope of Standard
Applications on H.264/AVC standard
Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc.
Interactive or serial storage on optical and magnetic devices, DVD, etc.
Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile networks, modems, etc. or mixtures of these.
Video-on-demand or multimedia streaming services over ISDN, cable modem, DSL, LAN, wireless networks, etc.
Multimedia messaging services (MMS) over ISDN, DSL, ethernet, LAN, wireless and mobile networks, etc.
Structure of H.264/AVC video encoder
Con
trol D
ata Video Coding Layer
Data Partitioning
Network Abstraction Layer
H.320 MP4FF H.323/IP MPEG-2 Etc.
Coded Macroblock
Coded Slice/Partition
Design feature highlights (1) — improved on prediction methods
Variable block-size motion compensation witVariable block-size motion compensation with small block sizesh small block sizes A minimum luma motion compensation block size
as small as 4×4. Quarter-sample-accurate motion compensatiQuarter-sample-accurate motion compensati
onon First found in an advanced profile of the MPEG-4 Vis
ual (part 2) standard, but further reduces the complexity of the interpolation processing compared to the prior design.
Design feature highlights (2) — improved on prediction methods
Motion vectors over picture boundariesMotion vectors over picture boundaries First found as an optional feature in H.263 is includ
ed in H.264/AVC. Multiple reference picture motion compensatiMultiple reference picture motion compensati
onon Decoupling of referencing order from display Decoupling of referencing order from display
orderorder (X)IBBPBBPBBP… => IPBBPBBPBB… Bounded by a total memory capacity imposed to e
nsure decoding ability. Enables removing the extra delay previously ass
ociated with bi-predictive coding.
Design feature highlights (3) — improved on prediction methods
Decoupling of picture representation methodDecoupling of picture representation methods froms from picture referencing capabilitypicture referencing capability B- frame could not be used as references for
prediction Referencing to closest pictures
Weighted predictionWeighted prediction A new innovation in H.264/AVC allows the motion-c
ompensated prediction signal to be weighted and offset by amounts specified by the encoder.
For scene fading
Design feature highlights (4) — improved on prediction methods
Improved “skipped” and “direct” Improved “skipped” and “direct” motion inferencemotion inference Inferring motion in “skipped” areas => fo
r global motion Enhanced motion inference method for “
direct”
Design feature highlights (5) — improved on prediction methods
Directional spatial prediction for intra cDirectional spatial prediction for intra codingoding Allowing prediction from neighboring area
s that were not coded using intra coding Something not enabled when using the tra
nsform-domain prediction method found in H.263+ and MPEG-4 Visual
Design feature highlights (6) — improved on prediction methods
In-the-loop deblocking filteringIn-the-loop deblocking filtering Building further on a concept from an opti
onal feature of H.263+ The deblocking filter in the H.264/AVC desi
gn is brought within the motion-compensated prediction loop
Design feature highlights (7) — other parts
Small block-size transformSmall block-size transform The new H.264/AVC design is based primari
ly on a 4×4 transform. Allowing the encoder to represent signals i
n a more locally-adaptive fashion, which reduces artifacts known colloquially as “ringing”.
Design feature highlights (8) — other parts
Hierarchical block transformHierarchical block transform Using a hierarchical transform to extend th
e effective block size use for low-frequency chroma information to an 8×8 array
Allowing the encoder to select a special coding type for intra coding, enabling extension of the length of the luma transform for low-frequency information to a 16×16 block size
Design feature highlights (9) — other parts
Short word-length transformShort word-length transform While previous designs have generally required
32-bit processing, the H.264/AVC design requires only 16-bit arithmetic.
Exact-match inverse transformExact-match inverse transform Building on a path laid out as an optional
feature in the H.263++ effort, H.264/AVC is the first standard to achieve exact equality of decoded video content from all decoders.
Integer transform
Design feature highlights (10) — other parts
Arithmetic entropy codingArithmetic entropy coding While arithmetic coding was previously
found as an optional feature of H.263, a more effective use of this technique is found in H.264/AVC to create a very powerful entropy coding method known as CABAC (context-adaptive binary arithmetic coding)
Design feature highlights (11) — other parts
Context-adaptive entropy coding Context-adaptive entropy coding CAVLC (context-adaptive variable-
length coding) CABAC (context-adaptive binary
arithmetic coding)
Design feature highlights (12) — Robustness to data errors/losses and flexibility for operation over a variety of network environments
Parameter set structureParameter set structure The parameter set design provides for robu
st and efficient conveyance header information
NAL unit syntax structureNAL unit syntax structure Each syntax structure in H.264/AVC is place
d into a logical data packet called a NAL unit
Design feature highlights (13) — Robustness to data errors/losses and flexibility for operation over a variety of network environments
Flexible slice sizeFlexible slice size Unlike the rigid slice structure found in
MPEG-2 (which reduces coding efficiency by increasing the quantity of header data and decreasing the effectiveness of prediction), slice sizes in H.264/AVC are highly flexible, as was the case earlier in MPEG-1.
Design feature highlights (14) — Robustness to data errors/losses and flexibility for operation over a variety of network environments
Flexible macroblock ordering (FMO)Flexible macroblock ordering (FMO) Significantly enhance robustness to data losses by
managing the spatial relationship between the regions that are coded in each slice
Arbitrary slice ordering (ASO)Arbitrary slice ordering (ASO) sending and receiving the slices of the picture in an
y order relative to each other first found in an optional part of H.263+ can improve end-to-end delay in real-time applicat
ions, particularly when used on networks having out-of-order delivery behavior
Design feature highlights (15) — Robustness to data errors/losses and flexibility for operation over a variety of network environments
Redundant picturesRedundant pictures Enhance robustness to data loss A new ability to allow an encoder to
send redundant representations of regions of pictures
Design feature highlights (15) — Robustness to data errors/losses and flexibility for operation over a variety of network environments
Data PartitioningData Partitioning Allows the syntax of each slice to be separated
into up to three different partitions for transmission, depending on a categorization of syntax elements
This part of the design builds further on a path taken in MPEG-4 Visual and in an optional part of H.263++.
The design is simplified by having a single syntax with partitioning of that same syntax controlled by a specified categorization of syntax elements.
Design feature highlights (16) — Robustness to data errors/losses and flexibility for operation over a variety of network environments
SP/SI synchronization/switching picturesSP/SI synchronization/switching pictures A new feature consisting of picture types that
allow exact synchronization of the decoding process of some decoders with an ongoing video stream produced by other decoders without penalizing all decoders with the loss of efficiency resulting from sending an I picture
Enable switching a decoder between different data rates, recovery from data losses or errors, as well as enabling trick modes such as fast-forward, fast-reverse, etc.
NAL (Network Abstraction Layer)C
on
trol D
ata Video Coding Layer
Data Partitioning
Network Abstraction Layer
H.320 MP4FF H.323/IP MPEG-2 Etc.
Coded Macroblock
Coded Slice/Partition
NAL (Network Abstraction Layer)
Designed in order to provide “network friendliness”
facilitates the ability to map H.264/AVC VCL data to transport layers such as: RTP/IP for any kind of real-time wire-line and wirel
ess Internet services (conversational and streaming);
File formats, e.g., ISO MP4 for storage and MMS; H.32X for wireline and wireless conversational servi
ces; MPEG-2 systems for broadcasting services, etc.
Key concepts of NAL
NAL Units Byte stream and Packet format uses
of NAL units Parameter sets Access units
NAL units
payload
1 byte header
Integer number of bytes
Interleaved as necessary with emulation prevention bytes, which are bytes inserted with a specific value to prevent a particular pattern of data called a start code prefix from being accidentally generated inside the payload.
The NAL unit structure definition specifies a generic format for use in both packet-oriented and bitstream-oriented transport systems, and a series of NAL units generated by an encoder is referred to as a NAL unit stream.
NAL units in byte-stream format use
E.g., H.320 and MPEG-2/H.222.0 systems require delivery of the entire or partial NAL
unit stream as an ordered stream of bytes or bits.
Each NAL unit is prefixed by a specific pattern of three bytes called a start code prefix.
payload
NAL units in packet-transport system use
E.g., internet protocol/RTP systems The inclusion of start code prefixes in
the data would be a waste of data carrying capacity, so instead the NAL units can be carried in data packets without start code prefixes.
payload
VCL and no-VCL NAL units
VCL NAL units The data that represents the values of the
samples in the video pictures Non-VCL NAL
Any associated additional information such as parameter sets (important header data that can apply to a large number of VCL NAL units) and supplemental enhancement information (timing information and other supplemental data that may enhance usability of the decoded video signal but are not necessary for decoding the values of the samples in the video pictures).
Parameter Sets (1)
A parameter set is supposed to contain information that is expected to rarely change and offers the decoding of a large number of VCL NAL units.
Parameter Sets (2)
Two types of parameter sets: Sequence parameter sets
Apply to a series of consecutive coded video pictures called a coded video sequence;
Picture parameter sets Apply to the decoding of one or more
individual pictures within a coded video sequence.
Parameter Sets (3)The Structure
VCL NAL unit
Picture parameter setIdentifier to Picture parameter set
Sequence parameter setIdentifier to Sequence parameter set
Non VCL NAL unit
Parameter Sets (4)Transmission
VCL NAL unitNon VCL NAL unit
Non VCL NAL unit
VCL NAL unit
In-band
Out of band
Parameter set use with reliable “out-of-band” parameter set exchange
H.264/AVC Encoder NAL unit with VCL Data encoded with PS#3 (address in Slice Header ) H.264/AVC Decoder
1 2 3 3 2 1Reliable Parameter Set Exchange
Parameter Set #3•Video format PAL•Entr. Code CABAC•…
Access Units
A set of NAL units in a specified form is referred to as an access unit.
access unit delimiter
SEI
primary coded picture
redundant coded picture
end of sequence
end of stream
end
start
Supplemental Supplemental Enhancement Enhancement InformationInformation
VCL NAL unitsVCL NAL unitsslices slices or slice data partitionsor slice data partitions
Coded Video Sequences
A coded video sequence consists of a series of access units that are sequential in the NAL unit stream and use only one sequence parameter set.
Can be decoded independently Start with an instantaneous decoding
refresh (IDR) – Intra picture. A NAL unit stream may contain one or
more coded video sequences.
MotionEstimation
FrameMemory
MotionCompensation
DCT Q
IQ
IDCT
Clipping
VLC- output
bitstream
inputvideo
De-blockingFilter
Intra-Prediction
Intra / inter
VCL (Video Coding Layer)
Decoder
16×16macroblocks
output video
YCbCr Color Space and 4:2:0 Sampling
Pictures, Frames, and Fields
∆t
Interlaced Frame (Top Field First)
ProgressiveFrame
TopField
BottomField
Slices and Slice Groups (1)
Slice #0
Slice #1
Slice #2
Subdivision of a picture into slices when not using FMO.(Flexible Macroblock Ordering)
Slices and Slice Groups (2)
Slice Group #2
Subdivision of a QCIF frame into slices utilizing FMO.
Slice Group #1
Slice Group #0 Slice Group #0
Slice Group #1
Slice coding types
I Slice P Slice B Slice SP Slice
Switching P slice efficient switching between different pre-coded pic
tures becomes possible. SI Slice
Switching I slice Allowing an exact match of a macroblock in an SP s
lice for random access and error recovery purposes.
Adaptive Frame/Field Coding Operation
Three modes can be chosen adaptively for each frame in a sequence. Frame mode Field mode Frame mode / Field coded
For a frames consists of mixed moving regions The frame/field encoding decision can be made for
each vertical pair of macroblocks (a 16×32 luma region) in a frame.
Macroblock-adaptive frame/field (MBAFF)
Picture-adaptive frame/field (PAFF)16% ~ 20% save over frame-onlyfor ITU-R 601 “Canoa”, “Rugby”, etc.
Macroblock-adaptive frame/field (MBAFF)
A Pair of Macroblocksin Frame Mode
Top/Bottom Macroblocksin Field Mode
PAFF vs MBAFF
The main idea of MBAFF is to preserve as much spatial consistency as possible.
In MBAFF, one field cannot use the macroblocks in the other field of the same frame as a reference for motion prediction.
PAFF coding can be more efficient than MBAFF coding in the case of rapid global motion, scene change, or intra picture refresh.
MBAFF was reported to reduce bit rates 14 ~ 16% over PAFF for ITU-R 601 (Mobile and Calendar, MPEG-4 World News)
Intra-Frame Prediction (1)
Intra 4×4 Well suited for coding of parts of a picture with sign
ificant detail. Intra_16×16 together with chroma prediction
More suited for coding very smooth areas of a picture.
4 prediction modes I_PCM
Bypass prediction and transform coding and send the values of the encoded samples directly
Intra-Frame Prediction (2)
In H.263+ and MPEG-4 Visual Intra prediction is conduced in the
transform domain In H.264/AVC
Intra prediction is always conducted in the spatial domain
Inter-Frame Prediction in P slices (1) Segmentations of the macro
blockMB TypesMB Types
8x8 Types8x8 Types
16
16 16
8 88
8 8
88 816
8
4 84
4 4
4
48
8
www.vcodex.com H.264 / MPEG-4 Part 10 : Inter Prediction H.264 / MPEG-4 Part 10 : Inter Prediction
*P_Skip
Inter-Frame Prediction in P slices (2) The accuracy of motion compensation
C D
A B
E
K L M N O P
F G H I J
T U
R S
cc dd ee ff
aa
bb
gg
hh
ba ce f gi j kp q r
dhn
m
s
b1=(E-5F+20G+20H-5I+J)h1=(A-5C+20G+20M-5R+T)
b=(b1+16) >> 5h=(h1+16) >> 5----------j1=cc-5dd+20h1+20m1-5ee+ff
j = (j1+512) >>10----------a=(G+b+1) >>1
e=(b+h+1) >> 1
clipped to0~255
clipped to0~255
Inter-Frame Prediction in P slices (3) Multiframe motion-compensated predic
tion
∆=1
∆=2∆=4
CurrentPicture
4 Prior Decoded PicturesAs Reference
Inter-Frame Prediction in B slices
Other pictures can reference pictures containing B slices
Weighted average of two distinct motion-compensated prediction
Utilizing two distinct lists of reference pictures (list0, list1)
4 prediction types list0, list1, bi-predictive, direct prediction, B_Skip
For each partition, the prediction type can be chosen separately.
Transform, Scaling, and Quantization(1)
4×4 DCT Integer transform matrix
H =
1 1 1 12 1 -1 -21 -1 -1 11 -2 2 -1
Transform, Scaling, and Quantization(2)
Repeated transforms
Intra_16×16, chroma intra modes are intend coding for smooth areas
The DC coefficients undergo a second transform with the results that we have transform coefficients covering the whole macroblock
0 1
2 3
00 01
10 11
Repeat transform for chroma blocks
indices correspond to the indices of2×2 inverse Hadamard transform
Transform, Scaling, and Quantization(3)
52 values An increase of 1in quantization parameter
means an increase of quantization step size by approximately 12% (an increase of 6 means an increase of quantization step size by exactly a factor of 2)
A change of step size by approximately 12% also means roughly a reduction of bit rate by approximately 12%
Transform, Scaling, and Quantization(4)
Scanning order Zig-zag scan For 2×2 DC coefficients of the chroma component
Raster-scan order All inverse transform operations in H.264/AVC
can be implemented using only additions and bit-shifting operations of 16-bit integer values
Only 16-bit memory accesses are needed for a good implementation of the forward transform and quantization process in the encoder
Entropy Coding
Two methods of entropy coding are suppoted An exp-Golomb code - A a single infinite-ex
tent codeword table for all syntax elements
For transmitting the quantized transform coefficients
Context-Adaptive Variable Length Coding (CAVLC)
CAVLC (1)
The number of nonzero quantized coefficients (N) and the actual size and position of the coefficients are coded separately
7, 6, -2, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0 ,0 ,0.
1) Number of Nonzero Coefficients (N) and “Trailing 1s”
T1s = 2, N=5,
These two values are coded as a combined event. One out of 4 VLC tables is used based on the number of coefficients in neighboring blocks.
CAVLC (2)
7, 6, -2, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0 ,0 ,0.
2) Encoding the Value of Coefficients
For T1s, only sign need to be coded.Coefficient values are coded in reverse order:-2, 6, …
A starting VLC is used for -2, and a new VLC may be used based on the just coded coefficient. In this way adaptation is obtained in the use of VLC tables, Six exp-Golomb code tales are available for this adaptation.
CAVLC (3)
7, 6, -2, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0 ,0 ,0.
3) Sign Information
For T1s, this is sent as single bits. For the other coefficients, the sign bit is included in the exp-Golomb codes
CAVLC (4)
4) TotalZeroes The number of zeros between the last nonzero coefficient of the scan and its start. TotalZeroes = 3
N=5, => the number must in the range 0-11, 15 tables are available for N in the range 1-15. (If N=16 there is no zero coefficient.)
7, 6, -2, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0 ,0 ,0.
5) RunBefore In this example it must be specified how the 3 zeros are distributed. The number of 0s before the last coefficient is coded. 2, => range:0-3 => a suitable VLC is used. 1, => range:0-1
CAVLC vs CABAC
To efficiency of entropy coding can be improved further if the Context-Adaptive Binary Arithmetic Coding (CABAC) is used.
Compared to CAVLC, CABAC typically provides a reduction in bit rate between 5%~15%.
The highest gains are typically obtained when coding interlaced TV signals.
In-Loop Deblocking filter
p0
p1
p2
q0 q1
q2
4×4 block edge
Time to apply deblocking filter
.For p0 and q0:1. |p0-q0|<α(QP)2. |p1-p0|<β(QP)3. |q1-q0|<β(QP)
.For p1 and q1:|p2-p0|<β(QP) or |q2q0|< β(QP)
*The filter reduces the bit rate typically by 5%~10%
Hypothetical Reference Decoder
In H.264/AVC HRD specifies operation of two buffers: The coded picture buffer (CPB)
Modeling the arrival and removal time of the coded bits.
The decoded picture buffer (DPB) Similar in spirit to what MPEG-2 had,
but is more flexible in support at a variety of bit rates without excessive delay.
Profiles and Levels
Baseline, Main, and Extended Baseline supports all features in H.264/
AVC except: Set 1: B slices, weighted prediction, CABAC,
field coding, and picture or macroblock adaptive switching between frame and field coding.
Set 2: SP/SI slices, and slice data partitioning.
Conclusions
Some important differences relative to prior standards. Enhanced motion-prediction capability Use of a small block-size exact –match transform Adaptive in-loop deblocking filter Enhanced entropy coding methods
When used well together, a approximately 50% bit rate savings for equivalent perceptual quality relative to the performance of prior standards.