A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai,

Post on 15-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on

Many-core ProcessorsChenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai, Liang Li, Qionghai Dai, and

Feng Wu.

IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 5, MAY 2014

2

Outline

Introduction Related Work Proposed Method Experimental Results Conclusion

3

Introduction(1/3)

In HEVC, each frame is divided into non-overlapping CTUs, which can be recursively split into smaller CUs.

For a CTU, the CU partitioning tree (CUPT) controls how a CTU is coded with CUs with variable block sizes and coding modes.

The price to be paid for higher coding efficiency is higher computational complexity.

4

Introduction(2/3)

To speed up the decision process of CUPT, many researchers have tried to reduce the search space by avoiding searching the full branches of the quad-tree [10]. In order to guarantee the coding efficiency, many

branches of the quad-tree can’t be skipped and the speedup is no more than two times.

Many researchers only consider the RD-based intra mode selection, while inter mode selection is much more time-consuming.• [10] L. Shen, Z. Liu, and X. Zhang et al., “An effective CU size decision method for HEVC

encoders,” IEEE Trans. Multimedia, vol. 15, pp. 465–470, Jan. 2013.

5

Introduction(3/3)

Many-core processors are good candidates for speeding up compression algorithms. Efficient parallelization of CUPT decision

(CUPTD) on many-core processors is challenging, because CUPTD has complicated data dependencies.

If CUPTD isn’t extensively parallelizable, cores will be left unused and performance might suffer.

6

Related Work(1/3)

HEVC CU Partition Tree Decision(CUPTD)

7

Related Work(2/3)

For RD-based intra prediction: Instead of applying the intra coding at PU level,

HEVC conducts intra prediction in TU level sequentially, which always utilize the nearest neighboring reference samples from the already reconstructed TUs.

To enhance the coding efficiency of HEVC, HEVC provides as many as 35 prediction modes.

Just like H.264/AVC, left, above, and above-right neighboring reconstructed sample will be used for intra prediction.

8

Related Work(3/3)

For RD-based inter prediction: The best motion vector predictor is selected from a given

advanced motion vector prediction candidate list. The AMVPCL is composed of both spatial candidates and

temporal candidates. Spatial candidates need the motion information of

neighboring left, left-down, upper, upper-left and upper-right PUs.

According to RD-based intra/inter prediction, the search of the current CU branch may have data dependencies on its neighboring left, left-down, upper, upper-left and upper-right CU branches.

9

Proposed Method A(1/2)

Problem Formulation

10

Proposed Method A(2/2)

HM-7.0 encoder tries to compute the best RD cost starting from .

• M : maximum depth of the CTU.• H0 and H1 : overhead of not splitting the CU and splitting the CU.• H() : the best RD cost computed for the CU, , without any restriction.• G() : the best RD cost computed for the CU, , that is not split into sub-CUs.

11

Proposed Method B(1/3)

CTU-Level Parallelism The best RD costs in the current CTU’s

neighboring left, upper, upper-left, and upper-right CTUs are computed.

The current CTU has data dependencies on its neighboring left, upper, upper-left, and upper-right CTUs.

We use the same DAG-based order as described in our previous work [14] to parallelize CTUs.

• [14] C. Yan et al., “Highly parallel framework for HEVC motion estimation on many-core platform,” in Data Compression Conf., Snowbird, UT, 2013, pp. 63–72.

12

Proposed Method B(2/3)

Generate a DAG to capture the dependency relationships of CTUs.

Consists of a set of vertices V and edges E. data dependency <=> an edge. Processed <=> remove

13

Proposed Method B(3/3)

14

Proposed Method B(1/) Step1 :

Initialize DQ and CM. DQ is a waiting queue. CM is designed to record the number of related CTUs for each CTU.

Step2 : When some values in the CM become zero, get the corresponding

coordinates and push them into DQ. Step3 :

Get coordinates from DQ and process corresponding CTUs in parallel on many-core platform.

Step4 : Update CM. When a CTU with coordinate (i, j) in CM is processed, the

values of coordinates (i+1, j), (i+1, j-1), (i,j+1) and (i+1,j+1) in CM will minus one operation.

Step5 : Repeat above steps 2~4 until each frame is over.

15

Proposed Method C(1/3)

CU-Level Parallelism When computing the of the current CTU , the left,

upper, upper-left and upper-right CTUs should have been completely decided RD-based inter/intra modes.

We analyze the dependencies in CU-level within the same frame:

There exist completely independent CUs (CICUs), which have no data dependencies on other CUs within the same CTU.

There exist partially independent CUs (PICUs), which have no data dependencies on other CUs when related CUs have been processed within the same CTU.

16

Proposed Method C(2/3)

CICUs : The CICU’s left boundary and CTU’s left boundary

overlap. The CICU’s upper boundary and CTU’s upper boundary

overlap.

17

Proposed Method C(3/3)

PICUs : PICUs don’t meet requirements of CICUs. The PICU’s left boundary and CTU’s left boundary

overlap or neighboring left largest size CU has been computed.

The PICU’s upper boundary and CTU’s upper boundary overlap or neighboring upper and upper-right largest size CUs have been computed.

18

Experimental Results

To compare our proposed method with serial execution, we adopt an encoder migrated from HEVC reference software HM7.0 without any optimization.

The experiment platform of this letter is based on Tile64, which is a member of TILERA many-core platform and contains 64 processing cores[17].

• [17] S. Bell et al., “TILE64-Processor: A 64-core SoC with mesh,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2008, pp. 88–598.

19

Experimental Results

20

Experimental Results

21

Conclusion

We propose an efficient parallel framework for HEVC CUPTD on many-core processors.

Experiments conducted on Tile64 platform demonstrate that our method saves more time than the default encoding scheme in HM 7.0.

top related