Efficient Realization of Parallel HEVC Intra Coding

Efficient Realization of Parallel HEVC Intra Coding

Yanan Zhao1, Li Song1, Xiangwen Wang2, Min Chen2, Jia Wang1

1Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China

2Shanghai University of Electric Power, Shanghai, China

July 14th, 2013

Introduction

HEVC is the newest video coding standard introduced by ITU-T VCEG and ISO/IEC MEPG.

Compared with H.264/AVC, HEVC decreases the bitrate by 50% percent on average while maintaning the same visual quality [1].

Fig.1. BQMall_832x480. Left: HEVC 1.5Mbps, right: x264 3.0Mbps

Introduction

However, encoding complexity is several times more complex than H.264.

• RDO: iterate over all mode and partition combinations to dicide the best coding information

• RODQ: iterate over many QP candidates for each block

• Intra: prediction modes increased to 35 for luma

• SAO: works pixel by pixel

• Quadtree structure: bigger block sizes and numerous partition manners

• Some other highly computational modules ...

As a result, the traditional method which performs the encoding in a sequential way could no longer provide a real-time demand, especially when it comes to HD (1920x1080) and UHD (3840x2160) videos.

Parallellism in the encoding procedure must be extensively utilized.

1. HEVC Intra Coding

• Quadtree partition structure• Flexible partition manners fit picture

characteristics better

• Three independent block concepts• CU - Coding Unit• PU - Prediction Unit• TU - Transform Unit

• Intra Prediction Modes• 35 modes for luma prediction• One special mode can also be used for

chroma prediction

Fig.2. Prediction modes in HEVC intra coding

In HEVC intra coding, all the prediction modes utilize the same basic set of reference samples, constituted by the pixels of left-bottom and left columns, top and top-right rows and left-top point.

The intra coding is performed by unit of CTU (largest CU) in raster scan order. Within each CTU, CUs are processed in quadtree traverse order. As a consequence, two levels of CU dependency exist.

2. HEVC Intra Dependency Analysis

In CTU-level, each CTU must waits until its left and top-right neighbor CTUs finish reconsturction. This makes the current CTU row always tow-CTU latent than its adjacent upper row.

Maximum parallelism is achieved if each CTU starts encoding whenever its two dependent CTUs finish.

Proceed like the wavefront. Fig.3.

2.1 CTU-level Dependency

thread0

thread1

thread2

thread3

thread4

thread5

thread6

thread7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

2

3

4

5

6

7

8

Fig.3. Maximum prallelism in CTU-level(each block stands for one CTU)

Intra denpendency also exists within one CTU. Assume CTU size is 32x32, and it can be further split into 16x16, 8x8 CUs, and 4x4 PUs in quadtree structure.

HEVC decides the best partition and best mode in a brute force way - compares the best cost of current CU and the sum costs of its four sub-CUs. The flow chart is given in Fig.4.

For each CTU, the encoder has to perform the following amount of calculations:

• IntraPred32x32 35 times

• IntraPred16x16 4x35 times



2.2 CU-level Dependency

Start

nDepth=0,nSize=32nMaxDepth=3

CheckRDCostIntra(nDepth,nSize)

nDepth<nMaxDepth?





nDepth++nSize/=2

Yes

SumChildCost<ParentCost?

SplitFlagCurrParCU=trueBestModes=BestChildModes

BestCost=SumChildCost

SplitFlagCurrParCU=falseBestModes=BestParMode

BestCost=ParentCost

nDepth==1?

nDepth--

End

Yes

No

Yes No

No

Fig.4. Deciding the best encoding cost in HEVC intra coding


Due to the sequential proporty, this process is really time-consuming, which prohibits the intra coding speed on a large extent.

Now we analyse the CU dependencies in terms of time iterations. We denote CU_N(x,y) as a CU with size NxN and relative coordinate (x,y) from CTU left-top pixel. We also assume the processing time for one CU_4x4 as one iteration. Then the processing time for each CU_8x8, CU_16x16, CU_32x32 are 4, 16, and 64 respectively.

The encoder first uses 64 iterations for CU_32(0,0), then 16 more iterations for CU_16(0,0), then 4 for CU_8(0x0), then CU_4(0,0) starts at the 64+16+8+4+1=85th iteration. Calculations for other CUs are similar. Here, we show the results of each 4x4 CUs in Fig.5.(a).

85 86 93 94

87 88 95 96

101 102 109 110

103 104 111 112

133 134

135 136

181 182 189 190

183 184 191 192

197 198 205 206

199 200 207 208

229 230 237 238

231 232 239 240

245 246 253 254

247 248 255 256

141 142

143 144

149 150

151 152

157 158

159 160

1 2 5a 6

3 4 7 8a

5b 8b 11a 12a

9a 10a 13a 14a

9b 10b

13b 14b

11c 14b 17b 20b

15c 16b 21b 22b

17c 23b 26a 27a

24b 25a 28a 29a

25b 26b 30a 31

28b 29b 32 33a

30b 33b 34b 35b

34a 35a 36 37

15a 16a

17a 18a

15b 18b

19a 20a

21a 22a

23a 24a

(a)sequential fasion (b)with maxmium parallelismFig.5. Starting times of each CU_4x4


Notice that CU_4(0,0), CU_8(0,0), CU_16(0,0) each only need a subset of the reference pixels of CU_32(0,0), so if CU_32(0,0) is ready for processing, so do the other three CUs.

Further, CU_4(8,0) and CU_4(0,8) are also ready for prediction once the partition and best modes of CU_16(0,0) are decided.

We use a DAG (Directed Acyclic Grahp) to visualize the CU dependencies[2]. As shown in Fig.6. The vertical aixs is iteration. CUs with same vertical coordinate can be started simultaneously.

If CUs are always started right at their readiness (enable parallel processing), maximum parallelism within one CTU could be achieved. Starting moments under this mechnism is shown in Fig.5.(b). It should be noted the although the last CU_4x4 finished at iteration 37, the whole CTU ends at 64, which is the finishing time of CU_32(0,0). Clearly, all the processing time of other CUs is hidden in the largest CU's processing time.

If parallel at this level is fully utilized, the thoeretical speedup gain for one CTU can be as high as

(257-64)/64x100% = 301.56%


CU_4x4(12,0)CU_4x4(12,0)

CU_4x4(8,4)CU_4x4(8,4)

CU_4x4(12,4)CU_4x4(12,4)

CU_4x4(8,0)CU_4x4(8,0)

select

CU_4x4(4,0)CU_4x4(4,0)

CU_4x4(0,4)CU_4x4(0,4)

CU_4x4(4,4)CU_4x4(4,4)

CU_8x8(0,0)CU_8x8(0,0) CU_4x4(0,0)CU_4x4(0,0)

select

CU_4x4(4,8)CU_4x4(4,8)

CU_4x4(0,12)CU_4x4(0,12)

CU_4x4(4,12)CU_4x4(4,12)

CU_4x4(0,8)CU_4x4(0,8)

select

CU_16x16(0,0)CU_16x16(0,0)

CU_4x4(12,8)CU_4x4(12,8)

CU_4x4(8,12)CU_4x4(8,12)

CU_4x4(12,12)CU_4x4(12,12)

CU_4x4(8,8)CU_4x4(8,8)

select

CU_4x4(20,0)CU_4x4(20,0)

CU_4x4(20,4)CU_4x4(20,4)

CU_4x4(16,0)CU_4x4(16,0)

select

CU_4x4(16,4)CU_4x4(16,4)

iterationiteration

11

22

33

44

CU_8x8(8,0)CU_8x8(8,0)

CU_8x8(0,8)CU_8x8(0,8)

CU_8x8(8,8)CU_8x8(8,8)

CU_8x8(16,16)CU_8x8(16,16)

select

88

1212

CU_32X32CU_32X32

1616

Fig.6. Part of the DAG for one CTU processing

3. Proposed Parallelization Scheme

In this section, we propose a two-stage parallelization speedup scheme exploiting CTU level parallelism.

The design of the scheme takes two major aspects into consideration• Maximizing encoding speedup

• Minimizing compression performance loss

The overall structure strikes a good balance between design effort, parallelism degree and RD performance.

• the first stage performs parallel processing by launching number of thread, with each thread processing one CTU row under the CTU-level constraint. The resulted prediction and partition information is then stored.

• in the second stage, a single thread is used to encode all the CTUs within the picture in raster scan order.

3. Proposed Parallelization Scheme

thread1

thread2

thread3

thread0

thread1

thread2

thread3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

2

3

4

5

6

7

8

thread0thread4 thread0

thread1thread4

thread3

thread0

thread1

thread2

thread3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

2

3

4

5

6

7

8

thread2

Fig.6. Different moments of the proposed parallelism shceme(The gray area are CTUs finished processing, while blue area are those finished entropy coding)

(b) thread1 starts processing the first CTU in 6th row(a) thread3 starts processing

This scheme achieves three benefits in terms of encoding speed and compression efficiency:

• Maximizing acceleration ratio - parallel proccesing the most computation - intensive part

• Minimizing performance loss - continuous entropy coding in the whole picture

• Further speedup gain - run stage 1& 2 simultaneously, encoding time is hidden in processing time

4. Simulation results

We implemented our algorithm on x265, an open source HEVC encoder[3].

The encoder is configured as follows: all intra, CTU size 32, split depth 3, QP 22, 27, 32,37, no fast split decision or fast intra mode selection algorithms. The benchmark is x265 default encoder which runs at single thread. All tests run on a HP workstation with 8 cores @ 2.6GHz.

The first group of experiments studies the relationship between speedup gains and thread numbers. Fig.7.

The second goup tests the proposed algorithm on different video sequences. The results is shown in Table.1.


Fig.7. Speedup gains with thread numbers

%100proposed

proposedanchor

T

TT

Test sequence: BasketballDrill_832x480_50

Speedup gain:


sequences QP Anchor (s) Proposed (s) gain

PeopleOnStreet_2560x1600_30

22 384.212 61.729 522%

27 375.194 58.203 545%

32 367.106 57.268 541%

37 366.805 57.205 541%

Traffic_2560x1600_30

22 386.036 58.391 586%

27 374.760 57.704 549%

32 370.128 57.642 542%

37 372.294 57.143 512%

BQTerrace_1920x1080_60

22 822.581 156.312 426%

27 806.491 132.708 508%

32 782.312 131.881 493%

37 757.416 131.694 475%

Cactus_1920x1080_50

22 646.017 127.295 407%

27 644.591 109.981 486%

32 638.755 110.055 480%

37 632.842 110.055 475%

ParkScene_1920x1080_24

22 327.787 55.972 486%

27 310.908 52.728 490%

32 310.066 52.759 488%

37 304.854 52.759 478%

Average 502%

Table.1. Speedup gains on diffrent video sequences

Newest Work - Speedup Inter Coding

Sequence (3840x2160) PSNR(dB) Bitrate(Mbps) Frame Rate (fps)

Cactus 43.266 3.994 30.16

Foreman 42.345 4.878 29.73

Coastguard 36.985 17.025 14.47

News 44.203 2.504 33.66

Suzie 41.640 5.157 28.32

Mobile 39.599 5.999 26.28

Library 38.782 3.405 31.34

BundNightscape 38.548 6.190 26.85

AncientTown 39.355 5.830 27.91

Horses 38.219 8.310 21.07

TrafficAndBuilding 37.567 5.683 25.01

Marathon 34.576 24.672 10.21

Average 39.590 7.111 25.42

Table.2. Performance on UHD sequences by speedup intra and inter coding

Table.2. shows the encoding performance using the proposed algorithm to speedup intra & inter coding.Configuration: QP 32, one I frames followed by 49 P frames. 16 threads are used. UHD sequences are from SJTU[4] and Elemental Techonologies[5].

References

[1] G. Sullivan, J.-R. Ohom, W.-J. Han, and T. Wiegand, "Overview of the high efficiency video coding (HEVC) standard", IEEE Trans. on CSVT, 2013

[2] N. Cheung, O. Au, M. Kung, "Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors", IEEE Trans. on CSVT, 2009

[3] x265 project, http://code.google.com/p/x265/

[4 ]http://medialab.sjtu.edu.cn//web4k/index.html

[5] http://www.elementaltechnologies.com/resources/4k-test-sequences

Questions?

Efficient Realization of Parallel HEVC Intra Coding

Documents

ctu latent

ctu size

hevc intra coding2

current ctu row

best partition

intra coding speed

unit of ctu largest

ctuleveleach block stands