Top Banner
DCT HSRA Implementation Joseph Yeh December 3, 1998
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DCT HSRA Implementation Joseph Yeh December 3, 1998.

DCT HSRA Implementation

Joseph Yeh

December 3, 1998

Page 2: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Outline

• Introduction / Overview• Rationale / Prior Art• Implementation

Strategy• Quality / Capacity• SCORE • Power

Page 3: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Overview

• Backbone of Image Compression Standards

• JPEG, MPEG• still image compression of 8-10:1

• Similarity Transform on Image Data

• B = CACT

Page 4: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Focus of Implementation

• 1D DCT• Needs to be done 16 times

on every 8x8 image block

Page 5: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Big Picture (JPEG)

Page 6: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Rationale

• Personal Background• Versatile - Two Mediabench Apps• Completely Feedforward • Platform to Compare Wavelets

With

Page 7: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Prior Art

• Common ASIC• Jeffrey Jacob (Toronto) Master

Thesis:– “Memory Interfacing for OneChip

Reconfigurable Processor”– 2D DCT on Altera Flex10K50

• Xilinx 2D DCT implementation

Page 8: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Implementation

• Many fast algorithms exist• Two provided in Mediabench

JPEG – Loeffler, Lightenberg, Moschytz

• jfdctint.c

– Arai, Agui, Nakajima• jfdctfst.c

• Both attempted

Page 9: DCT HSRA Implementation Joseph Yeh December 3, 1998.

General Data Flow

Source: Jeffrey Jacob Master Thesis

Page 10: DCT HSRA Implementation Joseph Yeh December 3, 1998.

General Data Flow

19 adds 5 mults 10 adds

Page 11: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Basic Strategy

• Whole design based on Ripple Adders– multipliers decomposed into Ripple

Adders– simple to manipulate– placement not restricted by Cascades– only need to skew at input and output

Page 12: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Spatial Implementation

Page 13: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Vital Statistics

• Level 11 Array- 2048 BLB’s• 8*9 = 72 Inputs • 2*(12)+2*(13)+4*(14) = 106 Outputs• 547 BLB’s of Logic

– only 26% usage

• Latency: 156 Cycles• 8-deep retiming will add 284 BLB’s

Page 14: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Speedup

• Feed-forward implies a set of outputs every cycle

• “gcc -O3” compilation of jfdctfst.c• on Sun Ultrasparc 10• gives 0.2 seconds for 1048576 1D DCT’s• 190 ns per DCT => Speedup ratio of 47

Speedup = (250 Mhz)*(MP cycles per DCT)/(Rate of MP)

Page 15: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Quality/Capacity (1)

• Choice of AAN itself a Quality decision

• Estimate of 13 BLB’s saved per bit of precision in mults

• Increased PSNR probably a fluke, but reduced capacity designs definitely worth considering

Precision BLB's PSNRBase 547 32.40Base-2 521 32.55Base-3 503 32.57Base-4 490 19.40

Page 16: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Quality/Capacity (2)

Base-2 Precision

Page 17: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Quality/Capacity (3)

Base-4 Precision

Page 18: DCT HSRA Implementation Joseph Yeh December 3, 1998.

SCORE

• Think in terms of whole JPEG application

• Reconsider block diagram in greater detail

Page 19: DCT HSRA Implementation Joseph Yeh December 3, 1998.

SCORE: Big Detailed Picture

Page 20: DCT HSRA Implementation Joseph Yeh December 3, 1998.

SCORE: Swapping Scheme

Page 21: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Power Statistics

• HSRA - 668 cycle run– Activity:

• LUT Outputs: 0.288• System Inputs: 0.232• Total: 0.285

Page 22: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Power Statistics

• RippleAdders– Most active LUTs in lower significant

bits of adders- especially in input array

– Higher order outputs flip sign frequently, causing bit toggling throughout two’s complement representation

Page 23: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Power Statistics

Page 24: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Power Statistics

• HSRA (600 cycle simulation):– Energy:

• LUT Outputs: 1095.57 pJ• Inputs: 137.29 pJ• Clock Energy: 6412800 pJ

» (300 pJ)(2^(levels-6))(cycles)

• Total: 6414033 pJ

Page 25: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Retrospect

• BOOM? – Current code size ~ 1500 lines of Java– Jeffrey Jacobs RTL (2D) ~2600 lines

• Primary concern not with architecture but with backend tools– postscript output of ar?

Page 26: DCT HSRA Implementation Joseph Yeh December 3, 1998.

Future Directions

• Reduced precision• Cascade-LUTs• More “spatially” suited

implementations of DCT?– Find one or make one up!

• IDCT