04/19/2023

04/19/2023

HARDWARE OPTIMIZED DCT-IDCT IMPLEMENTATION ON VERILOG HDL

RAHUL SRIKUMARECE734:VLSI ARRAY STRUCTURES FOR

DSP 05/10/13

04/19/2023

Contents• Algorithm• Implementations• Performance• Results• Conclusion• Future Work

04/19/2023

Algorithm• 8 point DCT • 2D DCT = C*X*Transpose(C)• C – coefficient matrix

25- 71 106- 126 126- 106 71- 25

49 118- 118 49- 49- 118 118- 49

71- 126 25- 106- 106 25 126- 71

91 91- 91- 91 91 91- 91- 91

106- 25 126 71 71- 126- 25- 106

118 49 49- 118- 118- 49- 49 118

126- 106- 71- 25- 25 71 106 126

91 91 91 91 91 91 91 91

C

04/19/2023

Algorithm(Cont’d)• 1D DCT = C*X• 2D DCT = Transpose(1D DCT)* C• 1D IDCT = Transpose(C) * 2D DCT • 2D IDCT = Transpose(1D IDCT) * Transpose(C)

04/19/2023

Implementations Part 1• Input word length – 8 bits• 1D DCT internal word length – 11 bits• 2D DCT output word length – 9 bits• 2D IDCT output word length – 8 bits• 4 implementations were evaluated

1. Serial In (SI) – 1 pixel at a time

2. 2 Parallel In (2PI) – 2 pixels at a time



04/19/2023

Implementations Part 2

• 8 registers of 8 bits each for coefficient storage.• very efficient when compared to 64 registers required for 8*8 DCT/IDCT computation.• 2 RAMS each of 64 locations(8 bit wide) are used.• RAMS are enabled in the order

en_ram1_write->(en_ram1_read, en_ram2_write)->en_ram2_read

04/19/2023

Performance 1

1. Serial In (1 pixel at a time)• Read 8 inputs = 8 cycles• Register 8 inputs + sign extension = 1 cycle• Add/Sub = 1 cycle• Absolute value = 1 cycle• Multiplication = 1 cycle• Final addition = 2 cycles• Total = 14 cycles

04/19/2023

Performance 2

1. 2 Parallel In (2 pixel at a time)• Register 8 inputs + sign extension = 4 cycle• Add/Sub = 1 cycle• Absolute value = 1 cycle• Multiplication = 1 cycle• Final addition = 2 cycles• Total = 9 cycles

04/19/2023

Performance 3


04/19/2023

Performance 4


04/19/2023

Synthesis

• Target Platform : ALTERA Cyclone IV GX FPGA• Tool Used : Quartus II• Language Used : Verilog

04/19/2023

Results 1

combinational blocks5600

5700

5800

5900

6000

6100

6200

6300

6400

Combinational Blocks

8 Parallel

4 Parallel In

2 Parallel In

Serial In

• Serial In has lowest synthesized combinational area because of lowest number of wires needed to feed in the data.

04/19/2023

Results 2

• Serial In has lowest synthesized area due to least number of storage elements and counters required to process the data.

Registers4520

4540

4560

4580

4600

4620

4640

4660

4680

4700

4720

Registers

8 Parallel

4 Parallel In

2 Parallel In

Serial In

04/19/2023

Results 3

• 8 parallel In takes 236 cycles in contrast to 246 for serial in.

Cycles to 2D IDCT of 8*8 block230

232

234

236

238

240

242

244

246

Total Computation Time

8 Parallel

4 Parallel In

2 Parallel In

Serial In

04/19/2023

Conclusion

• Serial In occupies ~6% less area than 8 parallel In with a performance degradation that is comparatively lower(~4%).

04/19/2023

References

• A Fast Hybrid Dct Architecture Supporting H.264, Vc-1, Mpeg-2, Avs And Jpeg Codecs by Muhammad Martuza, Carl McCrosky and Khan Wahid at 11TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCES, SIGNAL PROCESSING AND ITS APPLICATIONS.

• An Area Efficient Dct Architecture For Mpeg-2 Video Encoder by Kyeounsoo Kim and Jong-Seog Koh in IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 45, NO. 1, FEBRUARY 1999.

• Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse for MPEG-4 Video Coding by Hui-Cheng Hsu et. Al in IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008.

• Integer DCT Based on Direct-Lifting of DCT-IDCT for Lossless-to-Lossy Image Coding by Taizo Suzuki, Student Member, IEEE, and Masaaki Ikehara, Senior Member, IEEE in IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 11, NOVEMBER 2010.

Contents

Documents

transpose1d dct

integer dct

d dct output word length

d dct internal word

d idct output word length

fast hybrid dct architecture

time2 parallel

time4 parallel