Top Banner
06/23/2022
17

Contents

Dec 31, 2015

Download

Documents

Kevin Long

Hardware Optimized DCT-IDCT Implementation on Verilog HDL RAHUL SRIKUMAR ECE734:VLSI ARRAY STRUCTURES FOR DSP 05/10/13. Contents. Algorithm Implementations Performance Results Conclusion Future Work. Algorithm. 8 point DCT 2D DCT = C * X *Transpose( C ) - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Contents

04/19/2023

Page 2: Contents

04/19/2023

HARDWARE OPTIMIZED DCT-IDCT IMPLEMENTATION ON VERILOG HDL

RAHUL SRIKUMARECE734:VLSI ARRAY STRUCTURES FOR

DSP 05/10/13

Page 3: Contents

04/19/2023

Contents• Algorithm• Implementations• Performance• Results• Conclusion• Future Work

Page 4: Contents

04/19/2023

Algorithm• 8 point DCT • 2D DCT = C*X*Transpose(C)• C – coefficient matrix

25- 71 106- 126 126- 106 71- 25

49 118- 118 49- 49- 118 118- 49

71- 126 25- 106- 106 25 126- 71

91 91- 91- 91 91 91- 91- 91

106- 25 126 71 71- 126- 25- 106

118 49 49- 118- 118- 49- 49 118

126- 106- 71- 25- 25 71 106 126

91 91 91 91 91 91 91 91

C

Page 5: Contents

04/19/2023

Algorithm(Cont’d)• 1D DCT = C*X• 2D DCT = Transpose(1D DCT)* C• 1D IDCT = Transpose(C) * 2D DCT • 2D IDCT = Transpose(1D IDCT) * Transpose(C)

Page 6: Contents

04/19/2023

Implementations Part 1• Input word length – 8 bits• 1D DCT internal word length – 11 bits• 2D DCT output word length – 9 bits• 2D IDCT output word length – 8 bits• 4 implementations were evaluated

1. Serial In (SI) – 1 pixel at a time

2. 2 Parallel In (2PI) – 2 pixels at a time

3. 4 Parallel In (4PI) – 4 pixels at a time

4. 8 Parallel In (8PI) – 8 pixels at a time

Page 7: Contents

04/19/2023

Implementations Part 2

• 8 registers of 8 bits each for coefficient storage.• very efficient when compared to 64 registers required for 8*8 DCT/IDCT computation.• 2 RAMS each of 64 locations(8 bit wide) are used.• RAMS are enabled in the order

en_ram1_write->(en_ram1_read, en_ram2_write)->en_ram2_read

Page 8: Contents

04/19/2023

Performance 1

1. Serial In (1 pixel at a time)• Read 8 inputs = 8 cycles• Register 8 inputs + sign extension = 1 cycle• Add/Sub = 1 cycle• Absolute value = 1 cycle• Multiplication = 1 cycle• Final addition = 2 cycles• Total = 14 cycles

Page 9: Contents

04/19/2023

Performance 2

1. 2 Parallel In (2 pixel at a time)• Register 8 inputs + sign extension = 4 cycle• Add/Sub = 1 cycle• Absolute value = 1 cycle• Multiplication = 1 cycle• Final addition = 2 cycles• Total = 9 cycles

Page 10: Contents

04/19/2023

Performance 3

1. 4 Parallel In (4 pixel at a time)• Register 8 inputs + sign extension = 2 cycle• Add/Sub = 1 cycle• Absolute value = 1 cycle• Multiplication = 1 cycle• Final addition = 2 cycles• Total = 7 cycles

Page 11: Contents

04/19/2023

Performance 4

1. 8 Parallel In (8 pixel at a time)• Register 8 inputs + sign extension = 1 cycle• Add/Sub = 1 cycle• Absolute value = 1 cycle• Multiplication = 1 cycle• Final addition = 2 cycles• Total = 6 cycles

Page 12: Contents

04/19/2023

Synthesis

• Target Platform : ALTERA Cyclone IV GX FPGA• Tool Used : Quartus II• Language Used : Verilog

Page 13: Contents

04/19/2023

Results 1

combinational blocks5600

5700

5800

5900

6000

6100

6200

6300

6400

Combinational Blocks

8 Parallel

4 Parallel In

2 Parallel In

Serial In

• Serial In has lowest synthesized combinational area because of lowest number of wires needed to feed in the data.

Page 14: Contents

04/19/2023

Results 2

• Serial In has lowest synthesized area due to least number of storage elements and counters required to process the data.

Registers4520

4540

4560

4580

4600

4620

4640

4660

4680

4700

4720

Registers

8 Parallel

4 Parallel In

2 Parallel In

Serial In

Page 15: Contents

04/19/2023

Results 3

• 8 parallel In takes 236 cycles in contrast to 246 for serial in.

Cycles to 2D IDCT of 8*8 block230

232

234

236

238

240

242

244

246

Total Computation Time

8 Parallel

4 Parallel In

2 Parallel In

Serial In

Page 16: Contents

04/19/2023

Conclusion

• Serial In occupies ~6% less area than 8 parallel In with a performance degradation that is comparatively lower(~4%).

Page 17: Contents

04/19/2023

References

• A Fast Hybrid Dct Architecture Supporting H.264, Vc-1, Mpeg-2, Avs And Jpeg Codecs by Muhammad Martuza, Carl McCrosky and Khan Wahid at 11TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCES, SIGNAL PROCESSING AND ITS APPLICATIONS.

• An Area Efficient Dct Architecture For Mpeg-2 Video Encoder by Kyeounsoo Kim and Jong-Seog Koh in IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 45, NO. 1, FEBRUARY 1999.

• Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse for MPEG-4 Video Coding by Hui-Cheng Hsu et. Al in IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008.

• Integer DCT Based on Direct-Lifting of DCT-IDCT for Lossless-to-Lossy Image Coding by Taizo Suzuki, Student Member, IEEE, and Masaaki Ikehara, Senior Member, IEEE in IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 11, NOVEMBER 2010.