HEVC Encoder r11 Final

Analysis and Parallelism of HEVC Encoder

Kwangwoon University (KWU) Donggyu Sim ([email protected])

Contents

• Overview of HEVC

• Encoding issues for HEVC test model (HM)

• Complexity analysis of HEVC encoder

• Fast encoding algorithms and performances

• Issues of parallel processing

• Conclusion

OVERVIEW OF HEVC

Introduction of HEVC

• High efficiency video coding (HEVC) – JCT‐VC

• Joint collaborative team on video coding • ITU‐T SG16 Q.6 (VCEG) and ISO/IEC JCT1/SC29/WG11 (MPEG)

– Requirement • Development of a standard for video coding technology more advanced than

the current AVC standard • Development of the associated conformance testing and reference software

specifications • Casual goal: 50% coding efficiency than H.264/AVC

Timeline

• Delayed HEVC timeline – Test model selection process begins 2010/04 (1st JCTVC meeting at Dresden, DE) – Test model selection by 2010/10 (3rd JCTVC meeting at Guangzhou, CN) – Committee Draft (CD) approval by 2012/02 (8th JCTVC meeting at San Jose, US) – Final Committee Draft (FCD) – Draft International Standard (DIS) approval by 2012/07 (10th JCTVC meeting at

Stockholm, SE) – Final Draft International Standard (FDIS) approval by 2013/01 (12th JCTVC meeting at

Geneva, CH)

Block diagram of HEVC standard

• Typical block‐based hybrid codec structure + enhanced and additional tools

• CABAC

Entropy coding

• Deblocking filter • SAO

In‐loop filter

• Delta QP • RDO‐Q

Quantization

• TU 4x4 to 32x32 • Residual quad‐tree

transform

Transform

• AMVP • Merge • DCT‐IF

Inter prediction

• Angular intra prediction

Intra prediction

Bitstream

Inv.Quantization

Inv.Transform

• CU 8x8 to 64x64 • Diverse PU types

Block structure

• 8‐bit/sample

Picture storage

Picture

Block structure in HEVC

• Three block structures are defined in HEVC – Coding unit (CU) – Prediction unit (PU) – Transform unit (TU)

CU32×32

CU16×16 CU16×16 CTU64

CU8×8 CU8×8

CU16×16 CU8×8 CU8×8

CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8

CU8×8 CU8×8 CU8×8 CU8×8

CU16×16 CU16×16 CU16×16 CU8×8 CU8×8

CU8×8 CU8×8

CTU64×64 CTU64×64 CTU64

2N×2N N×2N 2N×N

2N×nU nR×2N nL×2N

… TU depth 0

TU depth 1

TU depth 2

2N×nD

N×N

Profiles for HEVC

• HEVC standard supports 3 profiles (Main, HE 10, Main Still Picture)

Main High efficiency 10 (HE10) Main Still Picture

High-level structure

High-level support for frame rate temporal nesting and random access Clean random access (CRA) support -

Rectangular tile-structured scanning Wavefront-structured processing dependencies for parallelism

Slices with spatial granularity equal to coding tree unit

Coding units, Prediction units, and Transform units

Coding unit quadtree structure

(square coding unit block sizes 2Nx2N, for N=4, 8, 16, 32; i.e., up to 64x64 luma samples in size)

Prediction units

(for coding unit size 2Nx2N: for Inter, 2Nx2N, 2NxN, Nx2N, and, for N>4, also 2Nx(N/2+3N/2) & (N/2+3N/2)x2N; for Intra, only 2Nx2N and

, for N=4, also NxN)

Prediction units

(2Nx2N and, for N=4, also NxN)

Transform unit tree structure within coding unit (maximum of 3 levels) Transform block size of 4x4 to 32x32 samples

(always square)

Spatial Signal Transformation and PCM

Representation

DCT-like integer block transform; for Intra also a DST-based integer block transform (only for Luma 4x4)

Transforms can cross prediction unit boundaries for Inter; not for Intra - PCM coding with worst-case bit usage limit

Intra-picture Prediction Angular intra prediction (35 directions ) Planar intra prediction

Inter-picture prediction

Luma motion compensation interpolation: 1/4 sample precision, 8x8 separable with 6 bit tap values -

Chroma motion compensation interpolation: 1/8 sample precision, 4x4 separable with 6 bit tap values -

Advanced motion vector prediction with motion vector “competition” and “merging” -

Entropy Coding Context adaptive binary arithmetic entropy coding RDOQ on

Picture Storage and Output Precision 8 bit-per-sample storage and output 10 bit-per-sample storage and output 8 bit-per-sample storage and output

In-Loop Filtering Deblocking filter Sample-adaptive offset filter

Level

Level

Max lum

a picture size MaxL

umaPS (s

amples)

Max C

PB size M

axCP

B (1000 bits)

Max slice segm

ents per picture MaxSli

ceSegmentsPerPicture

Max # of tile row

s MaxT

ileRow

s

Max # of tile colum

ns MaxT

ileCols

Max lum

a sample rate M

axLum

aSR

(samples/sec)

Max bit rate M

axBR

(1000 bits/s)

Min C

ompression R

atio MinC

R

Main tier

High tier

Main tier

High tier

1 36 864 350 - 16 1 1 552 960 128 - 2

2 122 880 1 500 - 16 1 1 3 686 400 1 500 - 2

2.1 245 760 3 000 - 20 1 1 7 372 800 3 000 - 2

3 552 960 6 000 - 30 2 2 16 588 800 6 000 - 2

3.1 983 040 10 000 - 40 3 3 33 177 600 10 000 - 2

4 2 228 224 12 000 30 000 75 5 5 66 846 720 12 000 30 000 4

4.1 2 228 224 20 000 50 000 75 5 5 133 693 440 20 000 50 000 4

5 8 912 896 25 000 100 000 200 11 10 267 386 880 25 000 100 000 6

5.1 8 912 896 40 000 160 000 200 11 10 534 773 760 40 000 160 000 8

5.2 8 912 896 60 000 240 000 200 11 10 1 069 547 520 60 000 240 000 8

6 35 651 584 60 000 240 000 600 22 20 1 069 547 520 60 000 240 000 8

6.1 35 651 584 120 000 480 000 600 22 20 2 139 095 040 120 000 480 000 8

6.2 35 651 584 240 000 800 000 600 22 20 4 278 190 080 240 000 800 000 6

HM ENCODER

Decision‐level for HEVC encoder

• Sequence‐level – Coding structure (All intra, Low delay, Random access) – Profile, tier, level – Max/Min CTU size, CU depth – Max/Min TU size, TU depth – Tool on/off (SAO, deblocking, WPP, tile)

• Picture‐level – #ref frame, rate control – Tile, slice

• Slice‐ or tile‐level – Ref frames – Deblocking filter parameters

• CTU‐level – CU partitioning – Sample adaptive offset parameters

• CU‐level – PU and TU partitioning

• PU‐ & TU‐level – Prediction modes, motion vectors – cbf, coefficients

Sequence

Picture

CTU

Slice or Tile

CU

PU & TU

Coding structure (1/3)

• All intra (AI) – All picture is coded as instantaneous decoding refresh (IDR) picture – No temporal prediction is allowed – No QP variation is allowed

IDR Picture

time

0

QPI

= POC

Coding order 1

QPI

2

QPI

3

QPI

4

QPI

5

QPI

6

QPI

7

QPI


• Low delay (LD) – The first picture shall be coded as IDR picture – Generalized P and B (GPB) picture shall be used for the other successive pictures

• The GPB shall be able to use only the reference pictures, each of whose POC is smaller than the current picture (all reference picture in List_0 and List_1 shall be temporally previous in display order relative to the current picture)

– QP of each inter coded picture shall be derived by adding offset to QP of Intra coded picture depending on temporal layer

IDR or Intra picture GPB(Generalized P and B)

picture

0

1

2

4

5 3

6

7

8

time

QPI

QPBL3=QPI+3

QPBL2=QPI+2

QPBL3 QPBL3 QPBL3

QPBL2

QPBL1=QPI+1 QPBL1 : Depth == 0 : Depth == 1 : Depth == 2

= POC

Coding order


• Random access (RA) – Hierarchical B structure shall be used for coding – IDR Intra picture or clean random access (CRA) picture shall be inserted cyclically per about one

second in random access point – QP of each inter coded picture shall be derived by adding offset to QP of Intra coded picture

depending on temporal layer

IDR or Intra picture

GPB(Generalized P and B) picture

0

4

3

2

7 5 8

1

time

Referenced B Picture

Non‐referenced B Picture

8

4

1

2

3 5

6

7

0

QPI

QPBL4=QPI+4 QPBL4 QPBL4 QPBL4

QPBL3=QPI+3 QPBL3

QPBL2=QPI+2

QPBL1=QPI+1

≠ POC

Coding order

: Depth == 0 : Depth == 1 : Depth == 2 : Depth == 3

Picture of HEVC

• Picture : A picture contains an array of luma samples in monochrome format or an array of luma samples and two corresponding arrays of chroma samples in 4:2:0, 4:2:2, and 4:4:4 color format.

– Coding order of coding tree unit (CTU) is raster scan order

Example) Class A (2560×1600) – NebutaFestival CTU size : 64×64 40×25 CTU partition

* CTU & CTB : The CTU consists of a luma coding tree block (CTB) and the corresponding chroma CTBs and syntax elements

Picture partitioning

• A slice is a sequence of coding tree units (CTUs)

• Unlike slices, tiles are always rectangular and always contain an integer number of coding tree units in coding tree unit raster scan

• At least one of the following conditions should be true for each slice and tile in a picture – All CTBs in a slice belong to the same tile, or all CTBs in a tile belong to the same slice

FIGURE. A picture with 40×25 coding tree units that is partitioned into two slices

FIGURE. A picture with 40×25 coding tree units that is partitioned into three tiles

Overall of HM encoding process

Sequence

Picture

CTU decisions in a slice or a tile

Deblocking filter SAO

CU partitioning decision PU & TU partitioning decision

RDO process

Maximum CTU size & CU depth

• Size of CTU is specified in sequence parameter set (SPS)

TABLE. Syntax for size of CTU in SPS seq_parameter_set_rbsp() { Descriptor … log2_min_coding_block_size_minus3 ue(v) log2_diff_max_min_coding_block_size_minus2 ue(v) … }

64 CU : split_coding_unit_flag(1) 32 CU : split_coding_unit_flag(0) 32 CU : split_coding_unit_flag(1) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(1) …

FIGURE. Example of CU quad‐tree structure

CU32×32

CU16×16 CU16×16

CU8×8 CU8×8

CU16×16 CU8×8 CU8×8

CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8

CU8×8 CU8×8 CU8×8 CU8×8

CU16×16 CU16×16 CU16×16 CU8×8 CU8×8

CU8×8 CU8×8

16x16 ~ 64x64

Coding unit (CU) decision

• Coding unit quad‐tree structure – Starting from CTU, each CU can be split into 4 smaller CUs

8×8

11 ‐

64×64

1 21

32×32

2 5

32×32

23 10

32×32

44 15

32×32

65 20

16×16

3 1

16×16

8 2

16×16

13 3

16×16

18 4

8×8

4 ‐

8×8

5 ‐

8×8

6 ‐

8×8

7 ‐

8×8

9 ‐

8×8

10 ‐

8×8

12 ‐

CU size

Best CU RD‐cost calculation for each CU level

Competition of the best CU and its sub‐partitioned CUs

… …

… …

Prediction unit (PU) types

• 2 PU types for Intra prediction – 2N×2N, (Smallest CU: additionally N×N)

• 8 PU types for Inter prediction – Smallest CU:

• 8x8: 2N×2N, N×2N, 2N×N • Others: 2N×2N, N×2N, 2N×N, N×N

– Others: 2N×2N, N×2N, 2N×N, nL×2N, nR×2N, 2N×nU, 2N×nD

FIGURE. PU partitions in HEVC

2N×2N N×2N 2N×N N×N

2N×nD 2N×nU nR×2N nL×2N

Prediction unit (PU) types Current CU size

SCU size AMP enable flag

Cur. CU size == SCU size

AMP enable flag

Cur. CU size == 8×8

Intra 2N×2N

Inter 2N×2N

Inter 2N×N

Inter N×2N

Inter AMP

Intra 2N×2N

Inter 2N×2N

Inter 2N×N

Inter N×2N

Intra 2N×2N

Inter 2N×2N

Inter 2N×N

Inter N×2N

Intra N×N

Intra 2N×2N

Inter 2N×2N

Inter 2N×N

Inter N×2N

Intra N×N

Inter N×N

No Yes

Yes Yes No No

Maximum TU size & TU depth

• Root of TU quad‐tree is CU which the TU belong to • Available transform block sizes and max transform hierarchy depth are specified in SPS

FIGURE. TU quad‐tree structure in HEVC

TABLE. Syntax for size of TU in SPS

32

32

seq_parameter_set_rbsp() { Descriptor

…

log2_min_transform_block_size_minus_2 ue(v)

log2_diff_max_min_transform_block_size ue(v)

…

max_transform_hierarchy_depth_inter ue(v)

max_transform_hierarchy_depth_intra ue(v)

…

}

4x4 ~ 32x32

Maximum 3‐level (0, 1, or 2)

RDO process to decide PU & TU

… …

… … Compress CU

2N×2N merge skip Inter 2N×2N

Inter N×2N Inter 2N×N Inter AMP

Intra 2N×2N Intra N×N

Intra PCM

CU size ≤ SCU

Compress CU Compress CU Compress CU Compress CU

Finish

No

Yes

Intra prediction flow

• Prediction modes – Luma

• DC, Planar, Angular prediction(33 directions) – Chroma

• DC, planar, ver, hor, DM

• Filtering – MDIS(Mode dependent intra smoothing) – DC filtering, Ver/Hor filtering

• 3 MPM

2N×2N PU

MDIS

Intra prediction

Reference sample padding

RD‐cost, Intra_mode

Best mode decision

N

Y

Mode<35?

Jmod e SSE mod e * Bmod e

J pred ,SATD SATD pred * Bpred

FIGURE. Flow chart ‐ Intra prediction FIGURE. Directions and modes of HEVC intra prediction

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Highlight

MyPC

Note

Most probably modes

Fast intra prediction & TU decision in HM

• Intra prediction step in HM 1) Rough prediction mode decision

• 35 prediction • Select N prediction modes

– Distortion(SATD) + lamda * mode bits

• # of candidate prediction modes : N modes + MPM (3) 2) Best intra prediction mode decision with transform

• Transform (RQT depth = 1) • 1 best intra mode decision

3) Best RQT decision with RD costs • RQT depth = 3

35 modes

N mode + MPM

1 Best mode

Best mode RD‐cost

MyPC

Highlight

Inter prediction

• Skip: Merge skip

• Non‐skip – Uni‐directional prediction – Bi‐directional prediction – Half‐pel/Quarter‐pel motion refinement

• DCT‐IF (8tap/4tap)

– Merge

Merge skip

Inter 2N×2N Inter 2N×N Inter N×2N

Best mode decision

Uni‐directional prediction

Bi‐directional prediction

Merge

Best mode decision

Cur. CU

RD‐cost, Best mode

Spatial candidates derivation

Temporal candidate derivation

Additional candidates derivation

RD‐cost calculation AMP

(nL×2N, nR×2N, 2N×nU, 2N×nD)

FIGURE. Flow chart ‐ Inter prediction

Inter prediction

• Inter coding mode – Merge skip mode (CU level)

• skip_flag=1 and merge_idx • No reference index • No motion vector • No residual

– Merge mode (PU level) • skip_flag=0, pred_mode_flag, and part_mode • merge_flag=1 and merge_idx • No reference index and motion vector • no_residual_syntax_flag: Residual is encoded or not

– General PU modes • skip_flag=0 pred_mode_flag, and part_mode • merge_flag=0 • ref_idx_lx and mvp_lx_flag based on AMVP (x=0 or 1) • MVD is encoded • no_residual_syntax_flag: Residual is encoded or not

Inter prediction flow BEGIN input : current PU part mode for a CU FOR PU partition FOR List = 0 to 1 DO FOR 0 to refidx DO Motion estimation (diamond search, SR : 64) Decide best RD-cost for uni-prediction ENDFOR ENDFOR IF bi-directional prediction THEN FOR iteration = 0 to 3 DO FOR 0 to refidx DO Motion estimation (full search, SR : 4) Decide best RD-cost for bi-prediction ENDFOR ENDFOR ENDIF ENDFOR Merge RD-cost competition among uni/bi-prediction and merge END output : inter prediction syntax

Fast encoder decision (FEN) –Sub‐sampled SAD for integer ME •Use sub‐sampled SAD when rows > 8 for integer ME –Only 1 iteration for bi‐predictive motion search •default number : 4

Fast Decision for Merge RD cost (FDM) –After merge with merge idx X, if all cbf is zero then merge process is terminated

FIGURE. Pseudo code - Inter prediction flow

time

Cur

Current PU

Uni‐prediction

Bi‐prediction

LIST_0 LIST_1

Bi‐prediction

• Search P0 and P1 which produce minimum error with O – R = (O – P), where P = (P0 + P1) / 2

• Practical Bi‐predictive search 1) Search P1 which produce minimum 2R with (2O – P0)

R = O – (P0 + P1) / 2 ⇒ 2R = (2O – P0) – P1 2) Search P0 which produce minimum error with (2O – P1)

R = O – (P0 + P1) / 2 ⇒ 2R = (2O – P1) – P0

• BipredSearchRange : 4 • FEN : 1 (iteration : 1)

P0 P1 O

List 1 Reference

List 0 Reference Current frame

Example) Bi‐prediction

Bi‐directional prediction

Iteration : 2

Iteration : 3

Uni‐directional prediction

P1 O

List 1 Reference

List 0 Reference Current frame

Search range : 64

P0 P1’ O

BipredSearchRange : 4

P0’ P1’ O


Iteration : 1

P0 P1 O


Iteration : 4

P0’ P1’’ O


P0’ 2O R0’

P1’ 2O R1’

P0 2O R0

P1 2O R1

Motion estimation (Integer‐pel)

• Practical motion estimation (diamond search) – First search & early termination

• Max 3 (default) more rounds after a recent best match

– Raster refinement search • If integer‐pel distance is bigger than 5, then conduct the raster refinement search.

– Star refinement search & early termination

• Diamond search with the center of the best match from the early two steps • Max 2 rounds after the best match

FIGURE. Raster refinement search

…

3

3 2 3 2 1 2

… 3 2 1 0 1 2 3 … 2 1 2 3 2 3

3

…

FIGURE. First search & start refinement

Motion estimation (Sub‐pel refinement)

• Integer‐pel motion search – Cost function : SAD

• Sub‐pel motion refinement – Cost function : SATD – Half‐pel refinement – Quarter‐pel refinement

FIGURE. Integer‐pel motion search

FIGURE. Half‐pel motion search

FIGURE. Quarter‐pel motion search

Search range

Search ra

nge

Integer‐pel

Half‐pel

Quarter‐pel

Interpolation

• DCT‐IF in HEVC – Fixed 8‐tap(7‐tap) and 4‐tap interpolation filters based on DCT – 2D separable filter

• 8*Horizontal 1D filter + 1*Vertical 1D filter

Component α Filter(α)

Luma 1/4 {‐1, 4, ‐10, 58, 17, ‐5, 1, 0}

1/2 {‐1, 4, ‐11, 40, 40, ‐11, 4, ‐1}

Chroma

1/8 { ‐2, 58, 10, 2 }

3/8 { ‐6, 46, 28, ‐4 }

1/4 { ‐4, 54, 16, ‐2 }

1/2 {‐4, 36, 36, ‐4 }

FIGURE. Integer and fractional sample positions for luma and chroma interpolation

TABLE. Interpolation filter coefficients A-1,-1 A0,-1 a0,-1 b0,-1 c0,-1 A1,-1

A-1,0 A0,0 A1,0

A-1,1 A0,1 A1,1a0,1 b0,1 c0,1

a0,0 b0,0 c0,0

d0,0

h0,0

n0,0

e0,0

i0,0

p0,0

f0,0

j0,0

q0,0

g0,0

k0,0

r0,0

d-1,0

h-1,0

n-1,0

d1,0

h1,0

n1,0

A2,-1

A2,0

A2,1

d2,0

h2,0

n2,0

A-1,2 A0,2 A1,2a0,2 b0,2 c0,2 A2,2

B0,0 ae0,0 ag0,0 ah0,0ab0,0 ac0,0 ad0,0 af0,0 B1,0

B1,1B0,1

be0,0 bg0,0 bh0,0bb0,0 bc0,0 bd0,0 bf0,0ba0,0

ce0,0 cg0,0 ch0,0cb0,0 cc0,0 cd0,0 cf0,0ca0,0

de0,0 dg0,0 dh0,0db0,0 dc0,0 dd0,0 df0,0da0,0

ee0,0 eg0,0 eh0,0eb0,0 ec0,0 ed0,0 ef0,0ea0,0

fe0,0 fg0,0 fh0,0fb0,0 fc0,0 fd0,0 ff0,0fa0,0

ge0,0 gg0,0 gh0,0gb0,0 gc0,0 gd0,0 gf0,0ga0,0

he0,0 hg0,0 hh0,0hb0,0 hc0,0 hd0,0 hf0,0ha0,0

ah-1,0

bh-1,0

ch-1,0

dh-1,0

eh-1,0

fh-1,0

gh-1,0

hh-1,0

he0,-1 hg0,-1 hh0,-1hb0,-1 hc0,-1 hd0,-1 hf0,-1ha0,-1

ba1,0

ca1,0

da1,0

ea1,0

fa1,0

ga1,0

ha1,0

ae0,1 ag0,1 ah0,1ab0,1 ac0,1 ad0,1 af0,1

Example of PU decision

Compress CU

2N×2N merge skip Inter 2N×2N

Inter N×2N Inter 2N×N Inter AMP

Intra 2N×2N Intra N×N

Intra PCM

CU size ≤ SCU

Compress CU Compress CU Compress CU Compress CU

Finish

No

Yes

Bi‐prediction RD‐cost = SAD/SATD + λ*Bmode

= 9000

Bi‐prediction RD‐cost = SSE + λ*Bmode

= 8500

Merge RD‐cost = SAD/SATD + λ*Bmode

= 11000

Uni‐prediction RD‐cost = SAD/SATD + λ*Bmode

= 12000

Vs.

Vs.

Example

No TU decision No reconstruction

TU decision Reconstruction

TU decision flow (Inter)

• Residual quad‐tree

2N×2N N×2N 2N×N N×N

2N×nD 2N×nU nR×2N nL×2N TU depth : 0

TU depth : 1

TU depth : 2 …

T/Q IT/IQ (recon) RD‐cost (SSE + λ*Bmode)

… …

Original Predictor Residual

TU decision flow (Intra)

• Example) intra_pred_mode = 10 (vertical mode)

Reference samples

Prediction direction

Intra prediction using reference samples T/Q IT/IQ RD‐cost (SSE + λ*Bmode)

Prediction direction

Reference sample (after above block is reconstructed)

TU depth : N

TU depth : N+1

… … …

Residual

Transform

• Implementation of transform in HEVC – Matrix multiplication

• Straightforward/Few code lines • Huge number of operations, but SIMD friendly

– Partial butterfly implementation • Utilizes symmetry/anti‐symmetry properties of basis vectors • Less multiplications/additions • Increase number of code lines

Matrix multiplication




Partitioning syntax for a CTU

Syntax

CU32×32

CU16×16 CU16×16

CU8×8 CU8×8

CU16×16 CU8×8 CU8×8

CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8

CU8×8 CU8×8 CU8×8 CU8×8

CU16×16 CU16×16 CU16×16 CU8×8 CU8×8

CU8×8 CU8×8

64 CU : split_coding_unit_flag(1) 32 CU : split_coding_unit_flag(0) 32 CU : split_coding_unit_flag(1) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(0) 16 CU : split_coding_unit_flag(1)

…

32x32 TU : split flag (1)

16x16 TU : split flag (0) <16x16 cbf, coefficients>

coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients>

…

32x32 TU : split flag (1)

16x16 TU : split flag (0) <16x16 cbf, coefficients>

16x16 TU : split flag (1) 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag(0) <8x8 cbf, coefficients>

16x16 TU : split flag (1) 8x8 TU : split flag(0) <8x8 cbf, coefficients> 8x8 TU : split flag (1) 4x4 TU : split flag(0) <4x4 cbf, coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients> 4x4 TU : split flag(0) <4x4 cbf, coefficients>

…

FIGURE. Example of TU quad‐tree structure

PU partition & Pred_mode info

TU split flags & Coefficients




‐ SKIP flag (merge idx) ‐ Prediction mode flag (intra or inter) ‐ PU part size (2Nx2N, 2NxN, Nx2N, NxN,

AMP) ‐ Prediction info. ( Intra mode or mv and

ref. idx., merge idx, AMVP idx)



In‐loop filter

• In HEVC, two processing steps, a deblocking filter (DBF) and a sample adaptive offset (SAO) operation are applied

– DBF: similar to the DBF of the H.264/AVC standard – SAO: applied adaptively to all samples satisfying certain conditions (while the DBF is only applied

to the samples located at block boundaries)

• On/off syntaxes for in‐loop filters

1. slice_disable_deblocking_filter_flag : slice‐level on/off 2. sample_adaptive_offset_enabled_flag : slice‐level on/off

Deblocking filter (DBF)

• Basically, deblocking filter of HEVC is similar to that of H.264/AVC – In‐loop filtering

• Coding performance for inter frame • Frame‐based filtering • On/off control is provided

– Adaptive filtering • boundary strength

– Filtering on the block boundaries • transform and prediction boundary

– Sequential filtering for vertical and horizontal edges • Sample values modified during filtering of vertical edges are used as input for the filtering of

the horizontal edges

Deblocking filter (DBF)

• Features of HEVC deblocking filter compared to H.264/AVC – For the TUs and PUs with edges less than 8 samples in either vertical or horizontal direction, only

the edges lying on the 8×8 sample grid are filtered

vertical edges ‐> horizontal filtering

horizontal edges‐> vertical filtering 2

1 vertical edges ‐> horizontal filtering

horizontal edges‐> vertical filtering 2

1

[e.g. 16x16 Coding unit]

H.264/AVC HEVC

(a) H.264/AVC (b) HEVC FIGURE. Derivation process for the boundary filter strength in AVC and HEVC

Processing flow of DBF

• Boundary decision – Three kinds of boundaries involving in the filtering

• CU, TU, PU boundary – CU boundaries are always involved in the filtering – TU boundary at 8×8 block grid and PU boundary between

each PU inside CU are involved in the filtering • [Except] PU boundary is inside TU, the boundary shall

not be filtered

• Bs calculation

– Bs is calculated in 4×4 block basis ‐> re‐mapped to 8×8 grid – Two Bs are belong to 8 pixels consisting a line in 4×4 grid,

maximum Bs is selected as Bs for boundaries in 8×8 grid

Boundary decision

Bs calculation (4×4 ‐> 8×8)

β, tc decision

filter on/off decision

Strong/weak filter selection

Strong filtering Weak filtering

FIGURE. Overall processing flow of deblocking filter process

• Filter on/off decision for 4 lines – If dp0+dq0+dp3+dq3 < β, filtering for the first 4 lines

is turned on

– Derive values for weak filtering process

– For the second 4 lines, decision is made in a same fashion with above


• β and tc decision – Threshold values β and tc are derived based on luma QPP and QPQ

– Q = ((QPP + QPQ + 1)>>1) – β and tc are specified as left table with Q as input (If Bs > 1, tc is specified as left table with clip3(0, 55, Q+2) as input)

Q 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

β 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 8

tc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Q 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

β 9 10 11 12 13 14 15 16 17 18 20 22 24 26 28 30 32 34 36

tc 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4

Q 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

β 38 40 42 44 46 48 50 52 54 56 58 60 62 64 64 64 64 64

tc 5 5 6 6 7 8 9 9 10 10 11 11 12 12 13 13 14 14

p30 p20 p10 p00 q00 q10 q20 q30

p31 p21 p11 p01 q01 q11 q21 q31

p32 p22 p12 p02 q02 q12 q22 q32

p33 p23 p13 p03 q03 q13 q23 q33

p34 p24 p14 p04 q04 q14 q24 q34

p35 p25 p15 p05 q05 q15 q25 q35

p36 p26 p16 p06 q06 q16 q26 q36

p37 p27 p17 p07 q07 q17 q27 q37

first 4 lines

second 4 lines

dEp1 = dp0+dp3 < (β+(β>>1))>>3 ? 1 : 0

dEq1 = dq0+dq3 < (β+(β>>1))>>3 ? 1 : 0

dp0 = | p2,0 – 2*p1,0 + p0,0 |

dp3 = | p2,3 – 2*p1,3 + p0,3 |

dp4 = | p2,4 – 2*p1,4 + p0,4 |

dp7 = | p2,7 – 2*p1,7 + p0,7 |

dq0 = | q2,0 – 2*q1,0 + q0,0 |

dq3 = | q2,3 – 2*q1,3 + q0,3 |

dq4 = | q2,4 – 2*q1,4 + q0,4 |

dq7 = | q2,7 – 2*q1,7 + q0,7 |

p3 p2 p1 p0 q0 q1 q2 q3


• Strong/weak filter selection for 4 lines – If following 2 conditions are met, strong filter – Otherwise, weak filter – Conditions for first 4 lines

– Conditions for second 4 lines

• Strong/weak filtering

dp0 = | p2,0 – 2*p1,0 + p0,0 |

dp3 = | p2,3 – 2*p1,3 + p0,3 |

dp4 = | p2,4 – 2*p1,4 + p0,4 |

dp7 = | p2,7 – 2*p1,7 + p0,7 |

dq0 = | q2,0 – 2*q1,0 + q0,0 |

dq3 = | q2,3 – 2*q1,3 + q0,3 |

dq4 = | q2,4 – 2*q1,4 + q0,4 |

dq7 = | q2,7 – 2*q1,7 + q0,7 |

1) 2*(dp0+dq0) < ( β >> 2 ), | p30 – p00 | + | q00 – q30 | < ( β >> 3 ) and | p00 – q00 | < ( 5* tC + 1 ) >> 1




p0’ = ( p2 + 2*p1 + 2*p0 + 2*q0 + q1 + 4 ) >> 3 q0’ = ( p1 + 2*p0 + 2*q0 + 2*q1 + q2 + 4 ) >> 3 p1’ = ( p2 + p1 + p0 + q0 + 2 ) >> 2 q1’ = ( p0 + q0 + q1 + q2 + 2 ) >> 2 p2’ = ( 2*p3 + 3*p2 + p1 + p0 + q0 + 4 ) >> 3 q2’ = ( p0 + q0 + q1 + 3*q2 + 2*q3 + 4 ) >> 3

= ( 9 * ( q0 – p0 ) – 3 * ( q1 – p1 ) + 8 ) >> 4 When abs() is less than tC *10, = Clip3( - tC , tC , ) p0’ = Clip1Y( p0 + ) q0’ = Clip1Y( q0 - ) If dEp1 is equal to 1, p = Clip3( -( tC >> 1), tC >> 1, ( ( ( p2 + p0 + 1 ) >> 1 ) – p1 + ) >>1 ) p1’ = Clip1Y( p1 + p ) If dEq1 is equal to 1, q = Clip3( -( tC >> 1), tC >> 1, ( ( ( q2 + q0 + 1 ) >> 1 ) – q1 – ) >>1 ) q1’ = Clip1Y( q1 + q )

Strong filtering

Weak filtering

Overview of sample adaptive offset (1/2)

• Artifacts – Blocking artifacts, ringing artifacts, color biases, and blurring artifacts – A larger transform could introduce more artifacts

• HEVC : 4x4 ~ 32x32 transform • Artifacts are exist at medium and low bit rates

– A large number of interpolation taps can also lead to more serious ringing artifacts • HEVC : 8‐tap (luma), 4‐tap (chroma)

• Sample adaptive offset

– To reduce sample distortion (reconstructed pixels ↔ original pixels) – Average 3.5% BD‐rate reduction (with 1% encoding time increase, 2.5% decoding time increase)

SAO is located after DF and also belongs to in‐loop filtering

Overview of sample adaptive offset (2/2)

• SAO features – Each color component may has its own SAO parameters – Two SAO types

• Edge offset (EO; 4 EO classes) • Band offset (BO; 1 BO class)

– SAO merging (left CTU or above CTU) • SAO merge information is shared for three color components

• SAO object and subjective results

SAO is enabled (QP=32)

SAO is disabled (QP=32)

Anchor: Disabling SAO Test: Enabling SAO

CTU size in Luma: 64x64 CTU Boundary: option 1

Y DB‐rate

All intra (AI)

Random access (RA)

Low delay B (LB)

Low delay P (LP)

Class Summary

Class A ‐0.6% ‐2.3%

ClassB ‐0.5% ‐2.1% ‐2.0% ‐11.1%

ClassC ‐0.5% ‐1.1% ‐1.8% ‐7.1%

ClassD ‐0.4% ‐0.3% ‐0.7% ‐4.4%

ClassE ‐0.6% ‐2.3% ‐11.0%

ClassF ‐1.5% ‐2.6% ‐5.7% ‐12.3%

Overall Summary

All ‐0.7% ‐1.7% ‐2.5% ‐9.2%

Enc. Time(%) 101% 100% 100% 100%

Dec. Time(%) 103% 103% 102% 102%

Edge offset of SAO

• Four 1‐D directional patterns – horizontal, vertical, 135° diagonal, 45° diagonal

• Only one EO class can be selected for each CTB of which EO is enabled • Each sample inside the CTB is classified into one of five categories

– One edge offset is encoded for each category (4 offsets are transmitted in the case of EO) – No information for classification of five categories (encoder and decoder use same rules)

a c b

a

c

b

a

c

b

a

c

b FIGURE. Four 1‐D directional patterns for EO sample classification

Category Condition

1 c < a && c < b

2 (c < a && c == b) || (c==a && c < b)

3 (c > a && c == b) || (c==a && c> b)

4 c > a && c > b

0 None of the above (SAO is not applied)

pixel index x‐1 x x+1

pixe

l lev

el category 1


pixe

l lev

el category 2


pixe

l lev

el pixel index

x‐1 x x+1

pixe

l lev

el category 3


pixe

l lev

el pixel index

x‐1 x x+1

pixe

l lev

el category 4

Positive edge offset Negative edge offset

TABLE. Sample classification rules for edge offset

Band offset of SAO

• BO implies one offset is added to all samples of the same band – The sample value range is equally divided into 32 bands – For 8‐bit samples ranging from 0 to 255, the width of a band is 8

• Only offsets of four consecutive bands and the starting band position are signaled to the decoder

– The average difference between the original samples and reconstructed samples in a band is signaled to the decoder

– Four offsets are transmitted in the case of BO

0 max

The first band for which offset is transmitted

Four offsets are transmitted for four consecutive bands

SAO syntaxes

• Merging information – sao_merge_left_flag – sao_merge_up_flag – No additional transfered data in merge case

• sao_type_idx_X: type of SAO

– 0: Not applied – 1: Band offset – 2: Edge offset

• sao_offset_abs, sao_offset_sign

– Sign is only for band offset – Signs of EO are implicitly derived

• sao_band_position (if BO)

– Start band for SAO

• sao_eo_class (if EO) – Class of edge offset (1‐D degree)

TABLE. Syntax for sample adaptive offset

sao (rx, ry) { Descriptor … sao_merge_left_flag ae(v) sao_merge_up_flag ae(v) if(!sao_merge_up_flag && !sao_merge_left_flag) { … sao_type_idx_luma ae(v) sao_type_idx_chroma ae(v) if( sao_type_idx != 0) { for(i=0; i<4; i++) sao_offset_abs[cIdx][rx][ry][i] ae(v) if(sao_type_idx == BO) { for(i=0; i<4; i++) { if( sao_offset_abs[cIdx][rx][ry][i] != 0) sao_offset_sign[cIdx][rx][ry][i] ae(v) } sao_band_position[cIdx][rx][ry] ae(v) } else { sao_eo_class_luma ae(v) } } … }

A fast distortion estimation for SAO

• Distortions have to be calculated many times • Let k, s(k), and x(k) be sample positions, original samples, and pre‐SAO samples,

respectively – Distortion between original samples and pre‐SAO samples

– Distortion between original samples and post‐SAO samples

– h is the offset for the sample set and N is the number of samples in the set , the delta distortion is defined (N and E can be calculated only once)

Ck

pre kxksD 2))()(((

Ck

post hkxksD 2)))(()((

Ck

prepost hENhkxkshhDDD 2)))()((2( 22

Ck

kxksE ))()((

RDJ

Offset refinement

• Initial offset value, h is E/N – All the numbers between zero and offset are used for offset refinement process

0

1

2

3

4

5

6 Initial offset

0

‐1

‐2

‐3

‐4

‐5

‐6 Initial offset

Ck

kxksE ))()((

Encoding flow of SAO in HM

Encoding picture

Deblocking Filtering

SAO (frame‐based)

CTB‐based processing

BO 32개의 band 별로 sum of difference, pixel count 계산

EO class0에 대해서 category 별로Sum of difference, pixel count 계산




EO class0에 대하여 rdcost 계산rdcost0 = distortion + λ× rate

( A fast distortion estimation, offset refinement )







BO에서 시작 band position 결정( A fast distortion estimation, offset refinement )

Rdcost가 가장 작은 type 결정(BO, EO class0, EO class1, EO class2, EO class3)

BO에 대한 rdcost 계산rdcostBO = distortion + λ× rate

Left merge, up merge에 대하여rdcost 경쟁

E

N

Slice‐level on/off control of SAO

• Hierarchical quantization parameter (QP) settings for each group of pictures

• A slice‐level on/off decision algorithm – For “depth=0” picture, SAO is always enabled in the slice header – Other depth

• If the previous picture (the last picture of depth “N‐1” in decoding order) disables SAO for more than 75% of CTBs, the current picture will early terminate the SAO encoding process and disable SAO in all slice headers

8k

(8k+4) Depth=0

Depth=1

Depth=2

Depth=3

A higher QP

(8k+2)

(8k+1) (8k+3) (8k+5) (8k+7)

(8k+6)

CTB‐based encoding issues about SAO

• Since SAO is after DF, the SAO parameters cannot be precisely estimated until the deblocked samples are available

– In CTU‐based encoder, the deblocked samples of the right columns and the bottom rows in the current CTB may be unavailable

• Two practical CTU‐based SAO decisions – Case 1. Avoiding using the bottom rows and right columns (current HM) – Case 2. Use non‐deblock‐filtered pixels for the bottom rows and right coloumns (JCTVC‐J0139)

TABLE. Average BD‐rates of enabling SAO versus disabling SAO for different CTU sizes

deblock‐filtered pixels

non‐deblock‐filtered pixels

CTU Size in Luma

Option 1: Skip right and bottom samples in the CTU during parameter estimation

Option 2: Use pre‐deblocked samples near right and bottom boundaries in the CTU during

parameter estimation

Y Cb Cr Y Cb Cr

6464 ‐3.5% ‐4.8% ‐5.8% ‐3.3% ‐5.3% ‐6.6%

3232 ‐2.0% ‐1.1% ‐1.5% ‐2.5% ‐2.0% ‐2.7%

1616 0.0% ‐0.3% 0.3% ‐0.8% 0.4% 0.1%

COMPLEXITY ANALYSIS OF HEVC ENCODER

Complexity analysis of HM encoder

• Test sequences – Sequence : Class B (1920×1080), Class C (832×480)

• Class B : Kimono, ParkScene, Cactus, BasketballDrive, BQTerrace

• Class C : BasketballDrill, BQMall, PartyScene, RaceHorse

– QP : 22, 27, 32, 37 – Main profile – Random access, low delay

• Test environment – HM 7.0 software – Intel® CoreTM i7 CPU 860 @ 2.8GHz – 4GB memory – Windows 7 (64‐bit) – Analysis tool : Intel® VtuneTM Amplifier XE

FIGURE. ClassB ‐ BasketballDrive

FIGURE. ClassC ‐ BQMall

Profiling result of HEVC encoder

Class Module QP

22 27 32 37

B

Entropy 6.6 3.4 1.0 0.9

Intra 3.3 2.2 2.1 1.4

Inter 68.4 78.1 83.9 85.7

TR+Q 20.4 15.2 11.7 10.6

Loop filter 0.2 0.2 0.2 0.1

etc 1.2 1.1 1.3 1.5

C

Entropy 6.5 3.9 2.8 1.3

Intra 2.9 2.7 2.2 1.8

Inter 68.8 74.9 79.8 83.3

TR+Q 20.7 17.0 13.9 12.4

Loop filter 0.2 0.2 0.2 0.1

etc 1.0 1.5 1.4 1.2

Class Module QP

22 27 32 37

B

Entropy 6.1 2.8 0.4 0.3

Intra 3.4 2.0 1.2 1.2

Inter 71.3 81.2 87.3 89.1

TR+Q 18.6 13.0 9.9 8.5

Loop filter 0.2 0.2 0.2 0.1

etc 0.8 1.2 0.8 0.9

C

Entropy 5.3 3.1 1.1 0.4

Intra 3.0 2.5 1.8 1.5

Inter 72.6 79.1 83.5 87.2

TR+Q 18.2 14.9 12.1 10.1

Loop filter 0.2 0.2 0.2 0.1

etc 1.1 0.6 1.6 1.0

TABLE. Complexity ratio of HM 7.0 encoder (RA) TABLE. Complexity ratio of HM 7.0 encoder (LD)

Complexity portions of HM encoder

Loop filter : 0.1‐0.2%

Inter prediction : 77‐81%

Intra prediction : 1‐2%

Entropy coding : 2‐4%

Tr + Q : 14‐16%

• CABAC

Entropy coding

• Deblocking filter • SAO

In‐loop filter

• Delta QP • RDO‐Q

Quantization

• TU 4x4 to 32x32 • Residual quad‐

tree transform

Transform

• AMVP • Merge • DCT‐IF

Inter prediction

• Angular intra prediction

Intra prediction

Bitstream

Inv.Quantization

Inv.Transform

• CU 8x8 to 64x64 • Diverse PU types

Block structure

• 8‐bit/sample

Picture storage

Picture

Inter prediction

Transform + Q

Intra prediction

Loop filter

Entropy coding

etc

FIGURE. HEVC encoder blockdiagram and profiling result

Complexity ratio of CU size and mode


CU32×32

CU16×16 CU16×16

CU8×8 CU8×8

CU16×16 CU8×8 CU8×8

CU16×16 CU16×16 CU8×8 CU8×8 CU8×8 CU8×8

CU8×8 CU8×8 CU8×8 CU8×8

CU16×16 CU16×16 CU16×16 CU8×8 CU8×8

CU8×8 CU8×8

TABLE. Complexity ratio of CU size and mode

Size Mode RA (%) LD (%) Average (%)

64x64

Intra 2.1 1.0 1.6

Inter 19.0 31.9 25.5

Skip 3.9 3.4 3.7

32x32

Intra 1.9 0.7 1.3

Inter 25.0 27.4 26.2

Skip 4.5 3.2 3.9

16x16

Intra 2.3 0.2 1.3

Inter 17.0 12.5 14.8

Skip 3.2 1.7 2.5

8x8

Intra 2.4 0.4 1.4

Inter 8.7 4.9 6.8

Skip 1.7 0.6 1.2

Selected ratio of CU, PU and TU

CU size PU mode

ClassB ClassC

22 27 32 37 22 27 32 37

64x64

Merge skip 10.6 26.6 43.3 55.2 11.7 20.6 30.6 39.5

Inter 2Nx2N 4.5 7.1 7.2 6.0 5.8 7.5 6.7 5.5

Inter Nx2N 1.4 2.2 1.8 1.3 1.6 1.8 1.7 1.7

Inter 2NxN 1.5 1.9 1.3 0.9 1.2 1.0 0.8 0.7

Inter AMP 1.2 1.4 1.0 0.7 1.0 1.1 1.0 1.1

Intra 2Nx2N 0.3 0.4 0.6 1.0 0.0 0.0 0.0 0.1

32x32

Merge skip 9.9 12.4 19.9 8.4 12.2 13.5 15.2 16.8

Inter 2Nx2N 8.1 6.9 4.6 3.1 9.1 7.2 5.4 4.3

Inter Nx2N 1.8 1.4 0.9 0.4 2.2 1.9 1.9 1.7

Inter 2NxN 1.7 1.3 0.7 1.0 1.4 1.0 0.9 0.8

Inter AMP 4.4 2.9 1.6 0.6 4.2 3.5 3.1 2.6

Intra 2Nx2N 2.3 2.3 2.6 2.6 0.2 0.4 0.7 1.1

16x16

Merge skip 6.8 5.6 3.9 2.9 8.0 7.7 7.3 6.1

Inter 2Nx2N 9.1 3.7 1.7 0.8 6.9 4.8 3.1 2.0

Inter Nx2N 1.6 0.7 0.3 0.1 2.0 1.4 1.0 0.6

Inter 2NxN 1.7 0.6 0.2 0.1 1.2 0.8 0.5 0.3

Inter AMP 4.1 1.4 0.5 0.2 4.1 2.7 1.7 0.9

Intra 2Nx2N 2.6 2.1 1.7 1.4 1.2 1.6 1.8 1.7

8x8

Merge skip 2.8 1.9 1.2 0.9 3.9 3.3 2.3 1.4

Inter 2Nx2N 5.8 1.3 0.4 0.1 4.9 2.5 1.1 0.4

Inter Nx2N 0.3 0.2 0.1 0.0 1.2 0.7 0.3 0.1

Inter 2NxN 0.4 0.2 0.1 0.0 0.7 0.4 0.2 0.1

Intra 2Nx2N 2.9 1.2 0.1 0.5 2.1 1.7 1.2 0.8

Intra NxN 0.8 0.6 0.7 0.2 1.9 1.1 0.6 0.3

Class Size QP

22 27 32 37

B

32x32 33.5 55.0 63.0 65.7

16x16 19.8 20.9 20.1 19.7

8x8 36.2 15.5 10.7 10.0

4x4 10.5 8.5 6.2 4.5

C

32x32 35.7 43.4 49.2 52.2

16x16 27.7 27.7 27.5 29.0

8x8 21.7 18.1 15.8 13.9

4x4 14.8 10.8 7.5 4.9

TABLE. Selected ratio of TU

TABLE. Selected ratio of CU size and PU mode

BD‐BR vs. Encoding time depending on CTU size

• CTU size : 32x32 – 3.3‐3.4 % BD‐bitrate – 78‐79 % encoding time

• CTU size : 16x16 – 15.4‐17.5 % BD‐bitrate – 50‐54 % encoding time

CTU size : 16x16 Enc T : 50.8 % BD‐bitrate : 17.53 %


CTU size : 64x64 (Reference)



SW : HM 7.1 Seq : Class B cfg : Random access & low delay

BD‐BR vs. Encoding time depending on TU size

• Transform size – 16×16 to 4×4 on case

• 3.2‐3.5 % BD‐bitrate • 96 % encoding time

– 8×8 to 4×4 on case • 10.2‐11.2 % BD‐bitrate • 91‐92 % encoding time

Max TU size : 8x8 Quad‐tree max depth : 1 Enc T : 92.4 % BD‐bitrate : 11.2 %




Max TU size : 32x32 Quad‐tree max depth : 3 (Reference)

SW : HM 7.1 Seq : Class B cfg : Random access & low delay

Tool on/off test

Fast encoding algorithms in HM software

Contents note

Fast Encoding Setting : FEN, JCTVC‐A0124

‐ Early CU termination ‐ Sub‐sampled SAD Operation ‐ Simple Bi‐prediction (The number of iteration 4 ‐> 1)

Fast Decision for Merge RD Cost : FDM, JCTVC‐H178 ‐ 2Nx2N Merge에 대한 CBF 를 이용한 early termination PU level

Rough Mode Decision (for Intra) : RMD, JCTVC‐C311/D283

‐ 35가지 Intra mode에 대해 SATD기반 RD 후보모드 선별 ‐ 선별된 RD 후보모드에 대해 RD기반 최적모드 선택 ‐ 최적모드에 대한 Full RQT 적용

PU level

AMP Speed‐up : AMPS, JCTVC‐E316 ‐ 특정 조건 하에서 AMP에 대한 ME or Merge 선택적 적용 PU level

CBF Fast Mode Setting : CFM, JCTVC‐F045 ‐ 현 PU의 CBF가 0이면 하위 PU들에 대한 ME 과정 생략 PU level

Early CU Setting : ECU, JCTVC‐F092 ‐ 현 CU의 최적모드가 Skip이면, 하위 CU 분할과정 생략 CU level

Early Skip Detection Setting : ESD, JCTVC‐G543 ‐ Inter 2Nx2N 연산 후 조건에 따른 Early Skip Detection CU level

TABLE. Fast encoding algorithms in HM software

PARALLEL TOOLS IN HEVC

Tile (JCTVC‐E408, JCTVC‐F335)

• Motivation and goal of tile – Picture partitioning introduces coding loss – Why partition a picture?

• High‐level parallel processing • Maximum transmission unit (MTU of the network) size matching • Motion estimation with constrained on‐chip memory

– Goal: reduce coding loss due to partitioning

• Vertical and horizontal boundaries partition • Boundary locations may be specified individually or uniformly spaced • Always rectangular with an integer number of CTUs

Tile#1 Core 1

Tile#4 Core 4

Tile#2 Core 2

Tile#3 Core 3

Tile & slice

• Tiles are always rectangular and always contain an integer number of coding tree blocks in coding tree block raster scan

• Parallel implementation with the tile tool has higher compression efficiency, compared to the slice one.

Tile#1 Core 1

Tile#4 Core 4

Tile#2 Core 2

Tile#3 Core 3

slice1 Core 1

slice2 Core 2 slice2 Core 2

slice3 Core 3 slice3 Core 3

slice4 Core 4

Slice mode Tile mode

Seq. BDR BDR‐low

BDR‐high BDR BDR‐

low BDR‐high

classD 3.9 6.1 2.4 2.7 3.6 2.0

class C 4.5 6.6 2.8 3.3 4.4 2.4

class B 5.3 7.6 3.5 4.0 5.1 3.2

class E 13.4 21.6 6.1 6.3 9.1 3.6

Avg. 6.2 9.6 3.6 3.9 5.3 2.8

Seq. BDR BDR‐low

BDR‐high

classD ‐1.2 ‐2.3 ‐0.4

class C ‐1.1 ‐2.1 ‐0.4

class B ‐1.2 ‐2.3 ‐0.4

class E ‐6.3 ‐10.3 ‐2.4

Avg. ‐2.1 ‐3.8 ‐0.8

TABLE. Slice mode and tile mode vs. anchor TABLE. Slice mode vs. tile mode

Relationship between slices and tiles

Slice#0

Slice#1

Tile #0 Tile #1

Slice#0

Slice#1

Tile #0 Tile #1

Slice#0

Slice#1

Tile #0 Tile #1

Tile #2 Tile #3

Slice#0

Slice#1

Tile #0 Tile #1

Tile #2 Tile #3

• Constraints are set on the relationship between slices and Tiles. – All CTBs in a slice belong to the same tile. or – All CTBs in a tile belong to the same slice.

Example 1 Example 2

Syntax for tile

• Syntax elements for tile

TABLE. Syntax for tile in PPS pic_parameter_set_rbsp () { Descriptor … tile_enabled_flag u(1) … if( tiles_enabled_flag ) { num_tile_columns_minus1 ue(v) num_tile_rows_minus1 ue(v) uniform_spacing_flag u(1) if( !uniform_spacing_flag ) { for(i=0; i<num_tile_columns_minus1; i++) column_width_minus1[i] ue(v) for(i=0; i<num_tile_rows_minus1; i++) row_height_minus1[i] ue(v) } loop_filter_across_tiles_enabled_flag u(1) } … }

Tile#0

Tile#5

Tile#1

Tile#4

Tile#2 Tile#3

num_tile_columns_minus1

num_tile_row

s_minus1

• uniform_spacing_flag equal to 1 specifies that column boundaries and likewise row boundaries are distributed uniformly across the picture.

column_width_minus1[0]

row_height_m

inus1[2]

Wave‐front parallel processing (WPP)

• Method to perform high‐level parallel video encoding and decoding

• Advantages – Loss of parallelization is relatively small

compared to other parallelization methods (WPP vs. slice or tile : 0.8~1% gain)

– Fast decoding and very low delay with the dependent slice

• For large sequences (Classes A, B, E) acceleration factor is 1.8x with two threads and 3x with four threads

• Disadvantages – Parallelization steps (prologue, kernel,

epilogue), the activate cores are limited

X

epilogue prologue kernel

time

Activ

ate co

res

Average activate cores

FIGURE. Single‐thread processing of HEVC encoding/decoding

X

X

X

X

FIGURE. Parallel processing without break pixel/MV dependency (ideally)

Wave‐front parallel processing (WPP)

• In HEVC, CABAC is the only entropy coder – Probabilities not available on first CTU of the line – If we re‐initialize at the beginning of each CTU line, performance degradation

• Synchronize CABAC probabilities with upper‐right CTU (JCTVC‐E196, JCTVC‐F274)

– Upper‐right CTU available, with spatial/MV dependency – Efficiently carries over quick vertical learning of probabilities

X

X

X

X

Probabilities not available!

X

X

X

X

FIGURE. Wavefront processing of HEVC CTUs (problem) FIGURE. Wavefront processing of HEVC CTUs (WPP)

• Bitstream contains any sub‐bitstream

Bitstream structure for WPP & Tiles

SPS PPS Slice header

Slice data

Slice layer rpsp

Subset 0 Subset 1 Subset 2

entry_point_offset[ i ] represented by 9bit

offset_len_minus1 (0 to 31) = 8 bit

num_entry_point_offsets = 3

slice_header( ) { Descriptor if( tiles_enabled_flag | | entropy_coding_sync_enabled_flag ) { num_entry_point_offsets ue(v) if( num_entry_point_offsets > 0 ) { offset_len_minus1 ue(v) for( i = 0; i < num_entry_point_offsets; i++ ) entry_point_offset[ i ] u(v) } } }

TABLE. Syntax for subsets in slice header

Only one of WPP or tile tools can be used for the main profile.

• Case 1 : multiple tiles in a slice – Slice layer rpsp bitstream consists of sub‐bitstreams for decoding – According to the tile partition, CTU decoding order can be decided


Slice#0

Slice#1

Tile #0 Tile #1

Tile #2 Tile #3

Slice#0

Slice#1

Tile #0 Tile #1

Tile #2 Tile #3

Example

Tile 0 Tile 1


Slice data

Slice 0

Tile 2 Tile 3

Slice header

Slice data

Slice 1

+

Bitstream structure

• Case 2 : multiple slices in a tile – According to the tile partition, CTU decoding order can be decided – Slice data includes start CTU address information



Slice data (Tile 0)

Slice 0

Slice header

Slice data (Tile 0)

Slice 1

Bitstream structure

Slice header

Slice data (Tile 1)

Slice 2

Example

+ Slice#0

Slice#2

Slice#1

Tile #0

Tile #1

Slice#0

Slice#2

Slice#1

Tile #0

Tile #1

Conventional Encoding procedure

• Conventional encoding process – Pixel encoding can be processed 2D wave‐front – But, entropy encoding is sequential process

• It requires many size of memory for syntax elements • Generally, encoding process conducts raster‐scan order due to entropy coder

… bitstream … CTB0 CTB1 CTB3 CTBk‐1 CTBk‐2 CTBk‐3

Syntax elements

Pixel encoding

entropy encoding

Tile encoding & decoding procedure

• Tile encoding/decoding process – Each tile has no dependency for pixel coding and entropy coding – All tiles can be encoding/decoding at the same time

Subset 0

Entropy encoding

Subset 1

Subset 2

Subset 3

Subset 0 Subset 1

Subset 3 Subset 2

Tile boundary processing (DB, SAO)


Pixel encoding

Bitstream transmission

Subset 0

Entropy decoding

Subset 1

Subset 2

Subset 3

Subset 0 Subset 1

Subset 3 Subset 2

Pixel decoding



WPP encoding & decoding procedure

• WPP encoding/decoding process – Pixel coding can be 2D wave‐front process – Entropy coding also can be 2D wave‐front process using contexts synchronized scheme

Pixel encoding Pixel decoding

Subset 0

Subset 1

Subset 2

Subset 3

Subset 0

Subset 1

Subset 2

Subset 3

Subset 3

Bitstream transmission

Subset 0

Subset 1

Subset 2

sync

sync

sync

Entropy encoding

Subset 3

Subset 0

Subset 1

Subset 2

sync

sync

sync

Entropy decoding

Conclusion

• Overview of HEVC • Encoding parameters for HEVC test model (HM) • Complexity analysis of HEVC encoder • Fast encoding algorithms and performances • Issues of parallel processing

HEVC Encoder r11 Final

Documents

coding tree unit coding

coding unit size 2nx2n

prediction unit boundaries

size prediction units

nxn prediction units

coding unit maximum

intra pcm coding

unit tree structure