Top Banner
A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpe g
34

Dongyue Mou and Zeng Xing

Jan 16, 2016

Download

Documents

Angie

Dongyue Mou and Zeng Xing. cujpeg. A Simple JPEG Encoder With CUDA Technology. Outline. JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion. Outline. JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion. JPEG Algorithm. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dongyue Mou and Zeng Xing

A Simple JPEG EncoderWith CUDA Technology

Dongyue Mou and Zeng Xing

cujpeg

Page 2: Dongyue Mou and Zeng Xing

Outline

• JPEG Algorithm

• Traditional Encoder

• What's new in cujpeg

• Benchmark

• Conclusion

Page 3: Dongyue Mou and Zeng Xing

Outline

• JPEG Algorithm

• Traditional Encoder

• What's new in cujpeg

• Benchmark

• Conclusion

Page 4: Dongyue Mou and Zeng Xing

JPEG Algorithm

JPEG is a commonly used method for image compression.

JPEG Encoding Algorithm is consist of 7 steps:1. Divide image into 8x8 blocks2. [R,G,B] to [Y,Cb,Cr] conversion3. Downsampling (optional)4. FDCT(Forward Discrete Cosine Transform)5. Quantization6. Serialization in zig-zag style7. Entropy encoding (Run Length Coding & Huffman coding)

Page 5: Dongyue Mou and Zeng Xing

This is an example

JPEG Algorithm -- Example 

Page 6: Dongyue Mou and Zeng Xing

This is an example

Divide into 8x8 blocks

Page 7: Dongyue Mou and Zeng Xing

This is an example

Divide into 8x8 blocks

Page 8: Dongyue Mou and Zeng Xing

RGB vs. YCC

The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance)

Color space conversion makes use of it!

Simple color space model: [R,G,B] per pixel

JPEG uses [Y, Cb, Cr] Model

Y = Brightness

Cb = Color blueness

Cr = Color redness

Page 9: Dongyue Mou and Zeng Xing

Convert RGB to YCC

8x8 pixel1 pixel = 3 components

MCU withsampling factor

(1, 1, 1)

Page 10: Dongyue Mou and Zeng Xing

Downsampling

MCU: minimum coded unit: The smallest group of data units that is coded.

Data size reduces to a half immediately

4 blocks16 x16 pixel

Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels

MCU withsampling

factor(2, 1, 1)

Page 11: Dongyue Mou and Zeng Xing

Apply FDCT

2D IDCT:

1D IDCT:

2-D is equivalent to 1-D applied in each direction

Kernel uses 1-D transforms

Bottleneck, the complexity of the algorithm is O(n^4)

Page 12: Dongyue Mou and Zeng Xing

Apply FDCT

Shift operations

From [0, 255]

To [-128, 127]

DCT Result

Meaning of each positionin DCT result-matrix

Page 13: Dongyue Mou and Zeng Xing

Quantization

DCT resultQuantization Matrix(adjustable according to quality)

Quantization result

Page 14: Dongyue Mou and Zeng Xing

Zigzag reordering / Run Length Coding

[ Number of Zero before me, my value]

Quantization result

Page 15: Dongyue Mou and Zeng Xing

Huffman encoding

RLC result:[0, -3] [0, 12] [0, 3]......EOB

After group number added:[0,2,00b] [0,4,1100b] [0,2,00b]...... EOB

First Huffman coding (i.e. for [0,2,00b] ): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron)

Total input: 512 bits, Output: 113 bits output

Values G Real saved values

0

-1, 1

-3, -2, 2, 3

-7,-6,-5,-4,5,6,7

.

.

.

.

.

.

.

.

.

-32767..32767

0

1

2

3

4

5

.

.

.

.

.

.

.

15

.

0,1

00, 01, 10, 11000,001,010,011,100,101,110,111

.

.

.

.

.

.

.

.

.

Page 16: Dongyue Mou and Zeng Xing

Outline

• JPEG Algorithm

• Traditional Encoder

• What's new in cujpeg

• Benchmark

• Conclusion

Page 17: Dongyue Mou and Zeng Xing

Traditional Encoder

CPU

Load image

Color conversion

DCT

Quantization

Zigzag Reorder

Encoding

Image

.jpg

Page 18: Dongyue Mou and Zeng Xing

Outline

• JPEG Algorithm

• Traditional Encoder

• What's new in cujpeg

• Benchmark

• Conclusion

Page 19: Dongyue Mou and Zeng Xing

Algorithm Analyse

1x full 2D DCT scan O(N4)

8x Row 1D DCT scan8x Column 1D DCT scan

O(N3)

8 threads can paralell work

Page 20: Dongyue Mou and Zeng Xing

Algorithm Analyse

Page 21: Dongyue Mou and Zeng Xing

DCT In Place

__device__ void vectorDCTInPlace(float *Vect0, int Step)

{

float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step;

float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step;

float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step;

float *Vect7 = Vect6 + Step;

float X07P = (*Vect0) + (*Vect7);

float X16P = (*Vect1) + (*Vect6);

float X25P = (*Vect2) + (*Vect5);

float X34P = (*Vect3) + (*Vect4);

float X07M = (*Vect0) - (*Vect7);

float X61M = (*Vect6) - (*Vect1);

float X25M = (*Vect2) - (*Vect5);

float X43M = (*Vect4) - (*Vect3);

float X07P34PP = X07P + X34P;

float X07P34PM = X07P - X34P;

float X16P25PP = X16P + X25P;

float X16P25PM = X16P - X25P;

(*Vect0) = C_norm * (X07P34PP + X16P25PP);

(*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM);

(*Vect4) = C_norm * (X07P34PP - X16P25PP);

(*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM);

(*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M);

(*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M);

(*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M);

(*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M);

}

__device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1);

for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); }

__device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8;

__syncthreads(); vectorDCTInPlace(block + row, 1); __syncthreads(); vectorDCTInPlace(block + col, 1); __syncthreads(); }

Page 22: Dongyue Mou and Zeng Xing

Allocation

Desktop PC– CPU: 1 P4 Core, 3.0GHz– RAM: 2GB

Graphic Card– GPU: 16 Core 575MHz

8 SP/Core, 1.35GHz– RAM: 768MB

Page 23: Dongyue Mou and Zeng Xing

Binding

Huffman Encoding• many conditions/branchs• intensive bit operating• less computing

Color conversion, DCT, Quantize• intensive computing• less conditions/branchs

Page 24: Dongyue Mou and Zeng Xing

Binding

Hardware: 16KB Shared MemoryProblem: 1 MCU contains 702 Byte data

Result: maximal 21 MCUs/CUDA Block

Hardware: 512 threadsProblem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads

Result: 1 MCU needs 24 threads

1 CUDA Block = 504 Threads

Page 25: Dongyue Mou and Zeng Xing

cujpeg Encoder

CPU

Load image

Color conversion

DCT

Quantization

Zigzag Reorder

Encoding

Image

.jpg

GPU

Page 26: Dongyue Mou and Zeng Xing

cujpeg Encoder

CPU

Encoding

Image

.jpg

GPU

TextureMemory

GlobalMemory

QuantizationReorderResult

Shared M

emory

ColorConversion

In PlaceDCT

QuantizeReorder

HostMemory

cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight ));cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice ));cudaBindTextureToArray(TexSrc, textureCache, channel));

cudaMalloc((void **)(&ResultDevice), ResultSize);

Load image

int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5;

myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb;

for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f);

cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost);

Page 27: Dongyue Mou and Zeng Xing

Scheduling

For each MCU:

•24 threads• Convert 2 pixel

•8 threads• Convert rest 2 pixel

•24 threads• Do 1x row vector DCT• Do 1x column vector DCT• Quantize 8x scalar value

Y Cb Cr

RGB Data

YCC Block

DCT Block

Quantized/Reordered Data

Y Cb Cr

x24

x24

x24

Page 28: Dongyue Mou and Zeng Xing

Outline

• JPEG Algorithm

• Traditional Encoder

• What's new in cujpeg

• Benchmark

• Conclusion

Page 29: Dongyue Mou and Zeng Xing

GPU Occupancy

Varying Register Count

My RegisterCount 16

0

6

12

18

24

0 4 8 12 16 20 24 28 32

Registers Per Thread

Mu

ltip

roc

es

so

r W

arp

Oc

cu

pa

nc

y

0

6

12

18

24

16 80 144 208 272 336 400 464

Varying Block Size

My Block Size 504

Threads Per Block

Multip

roce

ssor

War

p O

ccupan

cy

0

6

12

18

24

0 1024

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

13312

14336

15360

16384

Varying Shared Memory Usage

My Shared Memory 16128

Shared Memory Per Thread

Mu

ltip

roce

ssor

War

p O

ccu

pan

cy

Threads Per Block 504

Registers Per Thread 16

Shared Memory Per Block (bytes) 16128

Active Threads per Multiprocessor 504

Active Warps per Multiprocessor 16

Active Thread Blocks per Multiprocessor 1

Occupancy of each Multiprocessor 67%

Maximum Simultaneous Blocks per GPU 16

Page 30: Dongyue Mou and Zeng Xing

Benchmark

512x512 1024x1024 2048x2048 4096x4096

cujpeg 0.321s 0.376s 0.560s 1.171s

libjpeg 0.121s 0.237s 0.804s 3.971s

( Q = 80, Sample = 1:1:1 )

Page 31: Dongyue Mou and Zeng Xing

Benchmark

Time Consumption (4096x4096)

Load Tansfer Compute Encode Total

Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s

Quality = 80 0.121s 0.324s 0.043s 0.480 1.123s

Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s

Ti me Consumpt i on

l oad 10%transfer 27%

compute 3%

encode 47%

others 13%

l oad t ransf er compute encode others

Page 32: Dongyue Mou and Zeng Xing

Benchmark

Time Consumption (4096x4096)

Load Tansfer Compute Encode Total

Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s

Quality = 80 0.121s 0.324s 0.043s 0.480 1.123s

Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s

Each thread has 240 operations

24 threads process 1 MCU

4096x4096 image includes 262144 MCUs.

Total ops: 262144*24*210 = 1509949440 flops

Speed: (Total ops) /0.043 = 35.12Gflops

Page 33: Dongyue Mou and Zeng Xing

Outline

• JPEG Algorithm

• Traditional Encoder

• What's new in cujpeg

• Benchmark

• Conclusion

Page 34: Dongyue Mou and Zeng Xing

Conclusion

CUDA can obviously accelerate the JPEG compression.

The over-all performance• Depends on the system speed• More bandwidth• Besser encoding routine• Support downsample