Top Banner
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai, Liang Li, Qionghai Dai, and Feng Wu. IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 5, MAY 2014
21

A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

Jan 15, 2016

Download

Documents

Joleen Francis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on

Many-core ProcessorsChenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai, Liang Li, Qionghai Dai, and

Feng Wu.

IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 5, MAY 2014

Page 2: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

2

Outline

Introduction Related Work Proposed Method Experimental Results Conclusion

Page 3: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

3

Introduction(1/3)

In HEVC, each frame is divided into non-overlapping CTUs, which can be recursively split into smaller CUs.

For a CTU, the CU partitioning tree (CUPT) controls how a CTU is coded with CUs with variable block sizes and coding modes.

The price to be paid for higher coding efficiency is higher computational complexity.

Page 4: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

4

Introduction(2/3)

To speed up the decision process of CUPT, many researchers have tried to reduce the search space by avoiding searching the full branches of the quad-tree [10]. In order to guarantee the coding efficiency, many

branches of the quad-tree can’t be skipped and the speedup is no more than two times.

Many researchers only consider the RD-based intra mode selection, while inter mode selection is much more time-consuming.• [10] L. Shen, Z. Liu, and X. Zhang et al., “An effective CU size decision method for HEVC

encoders,” IEEE Trans. Multimedia, vol. 15, pp. 465–470, Jan. 2013.

Page 5: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

5

Introduction(3/3)

Many-core processors are good candidates for speeding up compression algorithms. Efficient parallelization of CUPT decision

(CUPTD) on many-core processors is challenging, because CUPTD has complicated data dependencies.

If CUPTD isn’t extensively parallelizable, cores will be left unused and performance might suffer.

Page 6: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

6

Related Work(1/3)

HEVC CU Partition Tree Decision(CUPTD)

Page 7: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

7

Related Work(2/3)

For RD-based intra prediction: Instead of applying the intra coding at PU level,

HEVC conducts intra prediction in TU level sequentially, which always utilize the nearest neighboring reference samples from the already reconstructed TUs.

To enhance the coding efficiency of HEVC, HEVC provides as many as 35 prediction modes.

Just like H.264/AVC, left, above, and above-right neighboring reconstructed sample will be used for intra prediction.

Page 8: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

8

Related Work(3/3)

For RD-based inter prediction: The best motion vector predictor is selected from a given

advanced motion vector prediction candidate list. The AMVPCL is composed of both spatial candidates and

temporal candidates. Spatial candidates need the motion information of

neighboring left, left-down, upper, upper-left and upper-right PUs.

According to RD-based intra/inter prediction, the search of the current CU branch may have data dependencies on its neighboring left, left-down, upper, upper-left and upper-right CU branches.

Page 9: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

9

Proposed Method A(1/2)

Problem Formulation

Page 10: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

10

Proposed Method A(2/2)

HM-7.0 encoder tries to compute the best RD cost starting from .

• M : maximum depth of the CTU.• H0 and H1 : overhead of not splitting the CU and splitting the CU.• H() : the best RD cost computed for the CU, , without any restriction.• G() : the best RD cost computed for the CU, , that is not split into sub-CUs.

Page 11: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

11

Proposed Method B(1/3)

CTU-Level Parallelism The best RD costs in the current CTU’s

neighboring left, upper, upper-left, and upper-right CTUs are computed.

The current CTU has data dependencies on its neighboring left, upper, upper-left, and upper-right CTUs.

We use the same DAG-based order as described in our previous work [14] to parallelize CTUs.

• [14] C. Yan et al., “Highly parallel framework for HEVC motion estimation on many-core platform,” in Data Compression Conf., Snowbird, UT, 2013, pp. 63–72.

Page 12: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

12

Proposed Method B(2/3)

Generate a DAG to capture the dependency relationships of CTUs.

Consists of a set of vertices V and edges E. data dependency <=> an edge. Processed <=> remove

Page 13: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

13

Proposed Method B(3/3)

Page 14: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

14

Proposed Method B(1/) Step1 :

Initialize DQ and CM. DQ is a waiting queue. CM is designed to record the number of related CTUs for each CTU.

Step2 : When some values in the CM become zero, get the corresponding

coordinates and push them into DQ. Step3 :

Get coordinates from DQ and process corresponding CTUs in parallel on many-core platform.

Step4 : Update CM. When a CTU with coordinate (i, j) in CM is processed, the

values of coordinates (i+1, j), (i+1, j-1), (i,j+1) and (i+1,j+1) in CM will minus one operation.

Step5 : Repeat above steps 2~4 until each frame is over.

Page 15: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

15

Proposed Method C(1/3)

CU-Level Parallelism When computing the of the current CTU , the left,

upper, upper-left and upper-right CTUs should have been completely decided RD-based inter/intra modes.

We analyze the dependencies in CU-level within the same frame:

There exist completely independent CUs (CICUs), which have no data dependencies on other CUs within the same CTU.

There exist partially independent CUs (PICUs), which have no data dependencies on other CUs when related CUs have been processed within the same CTU.

Page 16: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

16

Proposed Method C(2/3)

CICUs : The CICU’s left boundary and CTU’s left boundary

overlap. The CICU’s upper boundary and CTU’s upper boundary

overlap.

Page 17: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

17

Proposed Method C(3/3)

PICUs : PICUs don’t meet requirements of CICUs. The PICU’s left boundary and CTU’s left boundary

overlap or neighboring left largest size CU has been computed.

The PICU’s upper boundary and CTU’s upper boundary overlap or neighboring upper and upper-right largest size CUs have been computed.

Page 18: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

18

Experimental Results

To compare our proposed method with serial execution, we adopt an encoder migrated from HEVC reference software HM7.0 without any optimization.

The experiment platform of this letter is based on Tile64, which is a member of TILERA many-core platform and contains 64 processing cores[17].

• [17] S. Bell et al., “TILE64-Processor: A 64-core SoC with mesh,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2008, pp. 88–598.

Page 19: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

19

Experimental Results

Page 20: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

20

Experimental Results

Page 21: A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

21

Conclusion

We propose an efficient parallel framework for HEVC CUPTD on many-core processors.

Experiments conducted on Tile64 platform demonstrate that our method saves more time than the default encoding scheme in HM 7.0.