Saccess.ee.ntu.edu.tw/course/SOC實驗教材/final project … · Web view這一次project所用來作為起點的JPEG軟體encoder JENCSRC是個很小很簡潔的JPEG encoder,

SoC Labortory

Lab 10

JPEG Encoder

Team #7

P91922003 李彥勳P90922016 謝嵩淮R91921030 侯凱文

Project Description

The major goal of this project is to implement a fast JPEG encoder on the

ARM Integrator Development Board, and try to utilize as many resources on the

board as possible, that is, to partition this application into both hardware and

software components, then combine them altogether.

Topics

According to the following JPEG encoding flow block diagram (figure 1), we can

have a basic idea about what are the basic blocks of this problem.

Figure1: JPEG encoder block diagram

1. RGB → YUV: The input source is a bitmap file which represents colors in the well-

known RGB encoding; however, the JPEG specification requires colors to be

represented by Y-Cb-Cr (YUV) encoding, thus we need to convert the RGB colors

into YUV colors.

2. DCT (Discrete Cosine Transform): The main purpose of DCT is to convert the

energy level of the sampled signals into different spatial frequency. The converted

values can then be reduced or enhanced according to Human Vision System’s

characteristics for better compression.

3. Quantization: Use quantizer to greatly compress the information from the high

frequency part of energy variation which is quite insensible for human’s eyes.

4. Zig-Zag scan: Reorder the number sequence, and thus closer values can be grouped

together for further compression.

5. Entropy coding: Use a static Huffman encoding table, since dynamic Huffman

encoding requires more computation power. The higher frequency a number is to

appear, the fewer bits will be used to encode this number.

2

6. Profiling and hardware/software partitioning: We do profiling on a pure JPEG

software on ARM platform to see which part should be implemented in hardware,

which part keeps its software form.

7. Memory management: The on-board SRAM memory is no more than 4M bytes in

size, thus we cannot encode larger bitmaps. But there are 128M bytes SDRAM

installed on the core module, there should be ways to utilize all of them.

8. Custom IP design : After the Hardware/software partitioning is completed, the

hardware components should be implemented with FPGA and integrated with

AMBA bus.

9. Software optimization: The partitioned software components heavily depends on

floating point’s arithmetic, thus a fix-point implementation is preferred.

Introduction to development environment

The ARM Integrator Board is divided into 3 parts : main board, core module and

logic module. The ARM920 CPU is installed on core module, and there are

3

Figure 2 : Core module block diagram

128M bytes SDRAM and no more than 4M bytes SSRAM on board. The FPGA module

servers for important functions and thus should not be reprogrammed.

The block diagram of the logic module is shown in figure 3.

Figure 3: Logic module block diagram

The FPGA on logic module is the hardware resource that we can utilize for our

4

custom IP. The software and hardware cooperates for the JPEG application, and

communicate with each other thru the AMBA bus interface. In both figures, the grayed

part indicates the hardware resource we try to utilize.

Performance goal

The pure software version runs very slow, especially for large bitmaps (1024x768).

To encode such a bitmap can take us up to several minutes. However, in the development

of this project, we adopt “successive refinement” approach, so that we always keep a

“run-able” version, no matter whether our current modification is successful or not, we

always have a version which runs well in all cases.

The basic goal we would like to achieve is of course “a workable version”. By

“workable” we mean “hardware and software cooperates fine, no matter how poor the

performance could be”. But later we found that we go much farther than that. We do not

just merely achieve the baseline but the performance is also greatly enhanced. Our ideal

is to achieve “real-time” dynamic picture encoding, that is, at least 30 frames per second.

In other words, 0.33 seconds per frame, however, this requires a good pipelined

architecture and is yet to achieved.

Evaluation Introduction to software source

5

這一次 project所用來作為起點的 JPEG軟體 encoder JENCSRC是個很小很簡潔的 JPEG encoder, 並沒有做任何像Huffman code等 customized的 improvement,也沒有調 quality factor的參數, 一切都是照 standard來做一個最單純的 encoder.這個程式有一些有趣的地方,像是為了效能的考量,很多運算都是用 table lookup的方式來實現,有包括了 RGB to YUV的轉換,以及 Category也是省掉計算,直接用查表的,

在這個 code的說明中也有提到,除了 RGB bitmap所要的記憶體外,還另外需要256KB的記憶體來放這些 table,跟 RGB bitmap所需的記憶體比起來,的確是不算什麼,但是卻可省下不少的計算,向 RGB to YUV一次就需要 9個浮點乘法和 6個浮點加法,若用查表,每一個 pixel就只需要三次memory referencing以及 6個整數加法及三個 shifting:

直接運算-> Y = R * .299 + G * .587 + B * .114;

U = R * -.169 + G * -.332 + B * .500 ;

V = R * .500 + G * -.419 + B * -.0813 ;

查表-> Y = (YRtab[(R)] + YGtab[(G)] + YBtab[(B)]) >> 16 ) ;

U = (URtab[(R)] + UGtab[(G)] + UBtab[(B)]) >> 16 ) ; V = (VRtab[(R)] + VGtab[(G)] + VBtab[(B)]) >> 16 ) ;

在DCT上此程式採AAN的 algorithm,所以在 quantization matrix要相對的作調整,因此真正在程式中除的 quantization step並不是原先的數,這是需要特別注意的,

因為我們將來的DCT是打算使用 FPGA hardware implementation,所以我們要用的quantize matrix 是原來未加 scaling的matrix.

原 quantization matrix

16, 11, 10, 16, 24, 40, 51, 61,

12, 12, 14, 19, 26, 58, 60, 55,

14, 13, 16, 24, 40, 57, 69, 56,

14, 17, 22, 29, 51, 87, 80, 62,

6

18, 22, 37, 56, 68, 109, 103, 77,

24, 35, 55, 64, 81, 104, 113, 92,

49, 64, 78, 87, 103, 121, 120, 101,

72, 92, 95, 98, 112, 100, 103, 99

scaling factor:1.000 1.387 1.306 1.176 1.000 0.786 0.541 0.276

1.000, 1.000 1.387 1.306 1.176 1.000 0.786 0.541 0.276

1.387, 1.387 1.924 1.811 1.631 1.387 1.090 0.750 0.383

1.306, 1.306 1.811 1.705 1.536 1.306 1.027 0.707 0.360

1.176, 1.176 1.631 1.536 1.383 1.176 0.924 0.636 0.325

1.000, 1.000 1.387 1.306 1.176 1.000 0.786 0.541 0.276

0.786, 0.786 1.090 1.027 0.924 0.786 0.618 0.425 0.217

0.541, 0.541 0.750 0.707 0.636 0.541 0.425 0.293 0.149

0.276 0.276 0.383 0.360 0.325 0.276 0.217 0.149 0.076

原先的 quantization matrix乘上相對應位置的 scaling factor才是配合AAN DCT使用的 quantization matrix.

整體說來這個程式是十分的小而因為其簡潔而十分快速,寫的也很有技巧;相對的後來的 before.cpp 就寫的較為白話,但慢了許多,所以本組使用這一個 JENCSRC

來做 software的出發點.

Profiling

在選好了作為起點的 software後,為了要明白街下來的改進方向,profiling的資料就是重要的依據了,參考之前實驗的方法,可以做出純軟體的 profiling(以下節錄):

Name cum% self% desc% calls

------------------------------------------------------------------------------------------------

main_encoder 93.25% 0.10% 93.15% 1

7

load_data_units_from_RGB_buffer

4.63% 0.00% 475

process_DU 5.32% 83.18% 1425

------------------------------------------------------------------------------------------------

process_DU 88.51% 5.32% 83.18% 1425

fdct_and_quantization 15.92% 62.41% 1425

writebits 4.64% 0.19% 24759

------------------------------------------------------------------------------------------------

fdct_and_quantization 78.34% 15.92% 62.41% 1425

_fmul 11.67% 0.00% 205200

_fflt 3.52% 0.00% 90432

_fsub 20.42% 0.00% 269568

_fadd 18.32% 0.00% 387600

_f2d 1.43% 0.00% 86400

_dfix 2.81% 0.00% 87168

_dadd 4.22% 0.00% 90432

------------------------------------------------------------------------------------------------

可以看到光是 ProcessDU就佔了 88%的 execution time, ProcessDU包括了DCT, quantization以及 run-length + Huffman,而 run-length+Huffman共才 5%左右,可看到剩下的都是依些 float point的計算模擬函式,fmul, fadd,fsub這些的.因為這個軟體他的DCT和 quantization都是 float的 data type,所以實際上不太適合ARM這種沒有 built-in float point function unit的 processor,所以DCT和 quantization就變成了加快效能的重點所在.

HW/SW partitioning

HW/SW partitioning是整個系統成敗的重要關鍵,適當的 partitioning可以在cost和 performance間取得微妙的平衡,因此我們也不應一味的追求速度,依上面的profiling結果來看,floating point 的計算是讓人傷腦筋的部分,而DCT這種小量資料,

但是一個資料點上需要多個運算的 filter最適合用 hardware來加速,反過來想想8

quantization,是一個資料點上一個運算,所以花在 communication的 overhead可能會把 hardware acceleration所賺到的時間花掉,加上ARM是一顆 RISC CPU,一個 cycle

就能有一個整數的乘法結果出來,所以將 quantization仍歸於ARM CPU,但是改用fixed point運算,應能達到不錯的速度增進.

Implementation details

Memory Management

Implementation

The first problem we met is that the memory is not enough. Though the ARM

development suite provides some basic dynamic memory management functions, i.e.

malloc() and free(), but it only manages the SRAM which is a scarce resource. Whenever

we acquire a large memory block, say, 4M bytes, malloc() always return NULL. Thus

we need to use the 128M SDRAM memory under the following considerations:

1. SRAM is fast, so when ever we can use them, use them.

9

2. Software should be modified as few as possible.

3. The new function usage should be the same as old functions.

4. Since there is no other process compete against each other for memory resource,

the memory management algorithm should be as simple as possible.

Figure 4: Core module memory map

To cope the previous four considerations, we define two new APIs : heapalloc()

and heapfree(). The following sections describe the basic ideas of these two APIs. We

do not apply any complex memory management algorithm here, and we use dummy

allocation algorithm instead (i.e. assume the allocated large memory blocks are never

released).

‧ void *heapalloc( int size ):

global var heap_ptr = 0x100000 /* Refer to Figure 4. */

local var memblk /* memory block to return */

memblk = malloc(size)

if returnvar is not NULL,

10

then

return memblk

else

memblk = heap_ptr

adjust heap_ptr to ( heap_ptr + size ) with 4 bytes aligned.

return memblk

‧ void heapfree( void *ptr ):

if ptr is not in the range [0x100000 ~ 0x100000 + 128M bytes]

then

free(ptr) /* this memory block is allocated by malloc() /

else

do nothing. / we use dummy allocation, no ‘release’ here */

Problems met

The biggest problem we’ve met in this topic is that we found that the SDRAM

operation is quite slow. After careful analysis of the the run-time behavior, we got two

conclusion, first, the semi-hosted mechanism used in ARM platform redirects file

reading/writing thru the JTAG interface, which is very slow while reading/writing files.

Second, the large memory blocks we used in SDRAM is accessed in a ‘sequential’ way,

that is, there are a lot of cache miss during the operations.

Future works on this part

Memory management is a big topic, especially when multiple process is contending

with each other for the shard resource. However, this should be done by the RTOS.

When our application is getting more complex, we should leave this job taken cared by

11

the RTOS. So heapalloc() and heapfree() should then invoke the memory service

provided by the underlying RTOS.

Custom IP: RGB2YUV / DCT

Hardware implementation

We have successfully developed the RGB2YUV and DCT in the hardware since these two

parts cost most computational times. They are hooked in the AMBA bus and addressed

as the below tables.

write_head read_head ptr

DCT 0xcc000000 0xcc000020 0xcc000040

RGB2TUV 0xcc000060 0xcc000070 0xcc000040

Table1 : Memory of the DCT and RGB2YUV

1.Architecture of our RGBtoYUV :

We use the verilog file got form the opencores wed[1].The R/G/B input values can be

set to one of possible input values from 0 to 2^RGB_WIDTH-1 range. The input

format has an unsigned integer notation.

The Y output value is always unsigned fractional, however U and V components are

always signed. The hardware architecture is as below:

12

Figure 5: RGB2YUV block diagram

2.DCT:

The DCT provided from NCTU[2] is 1-dimensional Discrete Cosine transform. The

computational equation of Forward DCT is :

Software modified

Since we have designed the RGB2YUV and DCT in the hardware, we could replace the

13

two computational functions in ENC.C file. The functions could be rewritten as :

1. RGB2YUV part:

2. DCT part:

Figure 6: Source code modified

Problems met

1. The two JPEG software ENC.C and before.cpp provided from TA has it’s problem:

(1)ENC.C: when running bigger size of the bit map file, it will be out of memory.

The solution have be discussed in the previous section.

(2)berore.cpp: if we just use this file without any modification, the compressed image

quality is bad. The one solution we found is to replace the DCT function in

before.cpp. Below are the result compared among the original file, all software

implemented file and the DCT-hardware implemented file.

14

Original lena256.bmp

256 X 256

Size= 193 KB

Software lena256.jpg

256 X 256

Size= 13.3 KB

Hardware lena256.jpg

256 X 256

Size= 8.24 KB

Table2 : Compressed result

2. The coefficient of RGB2YUV is not consistent between the software and hardware, it

should take many efforts in adjustment the parameters in the hardware file.

Future work on this part

1. DCT pipeline:

The pipeline design improves the performance. The one-dimensional DCT could be

more easily implemented than two-dimensional DCT. Below is one design example

of 1D-DCT pipeline[3]:

Figure 6: Pipelined DCT

2. Power Estimation:

Low power is one of the most important issues today. We should concern the trade

off between the performance and the power consumption. So, the power estimation

sould de developed well.

fixed point quantization

15

details of implementation:

fixed point運算以整數加減乘除來代替浮點運算,在 implement時重要的有兩點,

第一點就是精確度.在我們的 project裡面,精確度的犧牲也許只是造成畫面品質的degradation,但是在其他場合,可能就會造成結果發散,系統 failure的狀況,所以事先的模擬評估是需要的.在來就是因為小數點都是自己處理,加減時要對好位數,乘除時要 shift對位,在 shift時要特別注意的是負數的問題,因為假設在一個三位小數的系統中,0xff代表的是-0.125,但是負數不管怎麼右移,都是 0xff,所以會造成被除不會再變小,這樣在 quantize時就會有錯,本來應該都被 quantize成 0的負數,都變成了-1,

所以在負數的 fixed point除法要 handle正負號,對負數要先轉正再除再轉回來.

problems met:

在實做 floating point轉 fixed point的 function時,發現了一個很大的問題就是

math.h的 pow() function,在次方式負的時候,沒有動作,這個不免就讓人懷疑ARM

對ANSI C library的支援度,這會影響到一些 legacy code以及程式的 portability, port

程式到ARM platform上的 effort也會因對ANSI C的支援程度受到影響.

Performance / Quality result Test-bench:

1. TAKAYO.bmp : 150 x 200 2.lena256.bmp: 256 x 256

16

2. Mika.bmp : 1024 x 768

Speed:

(Time unit : second) TAKAYO.bmp :

150 x 200

Lena256.bmp

256 x 256

Mika.bmp

1024x768

Original

Software

Load 6 11 125

Encode 18 39 438

Fixed

point

Load 6 11 124

Encode 15 33 364

Hardware

DCT

Load 5 11 126

Encode 12 25 280

Hardware Load 6 12 126

17

DCT +

fixed point

Encode 6 13 173

(Original Software: original JENCSRC running on ARM CPU.Fixed point: original source with fixed point quantization.Hardware DCT : original source with FPGA DCT IPHardware DCT + fixed point: merge both the two techniques into the encoder.)

We have noticed that due to the slow speed of JTAG, loading JPEG file is quite

slow. What surprised us is that memory since to be very slow too. Therefore we suspect

that slow memory speed is one of the reason that makes our encoder speed performance

to be on the degree of seconds. Slow memory is somewhat unreasonable, but maybe is

because some mis-configuration of the platform. This problem still waits for clues to

solve it.

Another point is that our DCT module implements one dimension DCT, therefore

two pass is needed. As a result, two times the data is transferred. While wasting one

time of the communication effort saves some of the FPGA gate counts.

Quality:

To prevent pasting too many picture, the result JPG files are supplemented by files.

Original software result:

18

Accelerated software result:

19

After changing the original precise floating point calculation to integer calculation,

more serious block effect and lose of details is observed.

Thoughts

We have gained experience in Hardware/Software Codesign by working the JPEG

encoder in ARM SoC platform. We learned how to design “myip” in AMBA bus and

communicate between core module and logic module. We also learned JPEG

algorithm implementation. It’s an interested project.

References on this part[1] Project: Color Space Converter RGB=>YUV http://www.opencores.org/projects/csc_rgb_yuv/[2] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html[3] A Pipelined JPEG CODEC ASIC http://www-cad.eecs.berkeley.edu/~newton/Classes/EE290sp99/pages/hw1/asic.htm

[4] AMBA AHB standard 2.0.[5] Integrator/CM940T,CM920T,CM740T, and CM720T User Guide, ARM.[6] Integrator/LM-XCV600E+, Integrator/LM-EP20K600E+ User Guide, ARM.

20

http://www-cad.eecs.berkeley.edu/~newton/Classes/EE290sp99/pages/hw1/asic.htm

http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html

Saccess.ee.ntu.edu.tw/course/SOC實驗教材/final project … · Web view這一次project所用來作為起點的JPEG軟體encoder JENCSRC是個很小很簡潔的JPEG encoder,

Documents