A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpe g
Jan 16, 2016
A Simple JPEG EncoderWith CUDA Technology
Dongyue Mou and Zeng Xing
cujpeg
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
JPEG Algorithm
JPEG is a commonly used method for image compression.
JPEG Encoding Algorithm is consist of 7 steps:1. Divide image into 8x8 blocks2. [R,G,B] to [Y,Cb,Cr] conversion3. Downsampling (optional)4. FDCT(Forward Discrete Cosine Transform)5. Quantization6. Serialization in zig-zag style7. Entropy encoding (Run Length Coding & Huffman coding)
This is an example
JPEG Algorithm -- Example
This is an example
Divide into 8x8 blocks
This is an example
Divide into 8x8 blocks
RGB vs. YCC
The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance)
Color space conversion makes use of it!
Simple color space model: [R,G,B] per pixel
JPEG uses [Y, Cb, Cr] Model
Y = Brightness
Cb = Color blueness
Cr = Color redness
Convert RGB to YCC
8x8 pixel1 pixel = 3 components
MCU withsampling factor
(1, 1, 1)
Downsampling
MCU: minimum coded unit: The smallest group of data units that is coded.
Data size reduces to a half immediately
4 blocks16 x16 pixel
Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels
MCU withsampling
factor(2, 1, 1)
Apply FDCT
2D IDCT:
1D IDCT:
2-D is equivalent to 1-D applied in each direction
Kernel uses 1-D transforms
Bottleneck, the complexity of the algorithm is O(n^4)
Apply FDCT
Shift operations
From [0, 255]
To [-128, 127]
DCT Result
Meaning of each positionin DCT result-matrix
Quantization
DCT resultQuantization Matrix(adjustable according to quality)
Quantization result
Zigzag reordering / Run Length Coding
[ Number of Zero before me, my value]
Quantization result
Huffman encoding
RLC result:[0, -3] [0, 12] [0, 3]......EOB
After group number added:[0,2,00b] [0,4,1100b] [0,2,00b]...... EOB
First Huffman coding (i.e. for [0,2,00b] ): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron)
Total input: 512 bits, Output: 113 bits output
Values G Real saved values
0
-1, 1
-3, -2, 2, 3
-7,-6,-5,-4,5,6,7
.
.
.
.
.
.
.
.
.
-32767..32767
0
1
2
3
4
5
.
.
.
.
.
.
.
15
.
0,1
00, 01, 10, 11000,001,010,011,100,101,110,111
.
.
.
.
.
.
.
.
.
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
Traditional Encoder
CPU
Load image
Color conversion
DCT
Quantization
Zigzag Reorder
Encoding
Image
.jpg
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
Algorithm Analyse
1x full 2D DCT scan O(N4)
8x Row 1D DCT scan8x Column 1D DCT scan
O(N3)
8 threads can paralell work
Algorithm Analyse
DCT In Place
__device__ void vectorDCTInPlace(float *Vect0, int Step)
{
float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step;
float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step;
float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step;
float *Vect7 = Vect6 + Step;
float X07P = (*Vect0) + (*Vect7);
float X16P = (*Vect1) + (*Vect6);
float X25P = (*Vect2) + (*Vect5);
float X34P = (*Vect3) + (*Vect4);
float X07M = (*Vect0) - (*Vect7);
float X61M = (*Vect6) - (*Vect1);
float X25M = (*Vect2) - (*Vect5);
float X43M = (*Vect4) - (*Vect3);
float X07P34PP = X07P + X34P;
float X07P34PM = X07P - X34P;
float X16P25PP = X16P + X25P;
float X16P25PM = X16P - X25P;
(*Vect0) = C_norm * (X07P34PP + X16P25PP);
(*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM);
(*Vect4) = C_norm * (X07P34PP - X16P25PP);
(*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM);
(*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M);
(*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M);
(*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M);
(*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M);
}
__device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1);
for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); }
__device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8;
__syncthreads(); vectorDCTInPlace(block + row, 1); __syncthreads(); vectorDCTInPlace(block + col, 1); __syncthreads(); }
Allocation
Desktop PC– CPU: 1 P4 Core, 3.0GHz– RAM: 2GB
Graphic Card– GPU: 16 Core 575MHz
8 SP/Core, 1.35GHz– RAM: 768MB
Binding
Huffman Encoding• many conditions/branchs• intensive bit operating• less computing
Color conversion, DCT, Quantize• intensive computing• less conditions/branchs
Binding
Hardware: 16KB Shared MemoryProblem: 1 MCU contains 702 Byte data
Result: maximal 21 MCUs/CUDA Block
Hardware: 512 threadsProblem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads
Result: 1 MCU needs 24 threads
1 CUDA Block = 504 Threads
cujpeg Encoder
CPU
Load image
Color conversion
DCT
Quantization
Zigzag Reorder
Encoding
Image
.jpg
GPU
cujpeg Encoder
CPU
Encoding
Image
.jpg
GPU
TextureMemory
GlobalMemory
QuantizationReorderResult
Shared M
emory
ColorConversion
In PlaceDCT
QuantizeReorder
HostMemory
cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight ));cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice ));cudaBindTextureToArray(TexSrc, textureCache, channel));
cudaMalloc((void **)(&ResultDevice), ResultSize);
Load image
int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5;
myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb;
for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f);
cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost);
Scheduling
For each MCU:
•24 threads• Convert 2 pixel
•8 threads• Convert rest 2 pixel
•24 threads• Do 1x row vector DCT• Do 1x column vector DCT• Quantize 8x scalar value
Y Cb Cr
RGB Data
YCC Block
DCT Block
Quantized/Reordered Data
Y Cb Cr
x24
x24
x24
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
GPU Occupancy
Varying Register Count
My RegisterCount 16
0
6
12
18
24
0 4 8 12 16 20 24 28 32
Registers Per Thread
Mu
ltip
roc
es
so
r W
arp
Oc
cu
pa
nc
y
0
6
12
18
24
16 80 144 208 272 336 400 464
Varying Block Size
My Block Size 504
Threads Per Block
Multip
roce
ssor
War
p O
ccupan
cy
0
6
12
18
24
0 1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
14336
15360
16384
Varying Shared Memory Usage
My Shared Memory 16128
Shared Memory Per Thread
Mu
ltip
roce
ssor
War
p O
ccu
pan
cy
Threads Per Block 504
Registers Per Thread 16
Shared Memory Per Block (bytes) 16128
Active Threads per Multiprocessor 504
Active Warps per Multiprocessor 16
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 67%
Maximum Simultaneous Blocks per GPU 16
Benchmark
512x512 1024x1024 2048x2048 4096x4096
cujpeg 0.321s 0.376s 0.560s 1.171s
libjpeg 0.121s 0.237s 0.804s 3.971s
( Q = 80, Sample = 1:1:1 )
Benchmark
Time Consumption (4096x4096)
Load Tansfer Compute Encode Total
Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s
Quality = 80 0.121s 0.324s 0.043s 0.480 1.123s
Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s
Ti me Consumpt i on
l oad 10%transfer 27%
compute 3%
encode 47%
others 13%
l oad t ransf er compute encode others
Benchmark
Time Consumption (4096x4096)
Load Tansfer Compute Encode Total
Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s
Quality = 80 0.121s 0.324s 0.043s 0.480 1.123s
Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s
Each thread has 240 operations
24 threads process 1 MCU
4096x4096 image includes 262144 MCUs.
Total ops: 262144*24*210 = 1509949440 flops
Speed: (Total ops) /0.043 = 35.12Gflops
Outline
• JPEG Algorithm
• Traditional Encoder
• What's new in cujpeg
• Benchmark
• Conclusion
Conclusion
CUDA can obviously accelerate the JPEG compression.
The over-all performance• Depends on the system speed• More bandwidth• Besser encoding routine• Support downsample