DCT HSRA Implementation Joseph Yeh December 3, 1998.

Post on 21-Dec-2015

222 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

Transcript

DCT HSRA Implementation

Joseph Yeh

December 3, 1998

Outline

• Introduction / Overview• Rationale / Prior Art• Implementation

Strategy• Quality / Capacity• SCORE • Power

Overview

• Backbone of Image Compression Standards

• JPEG, MPEG• still image compression of 8-10:1

• Similarity Transform on Image Data

• B = CACT

Focus of Implementation

• 1D DCT• Needs to be done 16 times

on every 8x8 image block

Big Picture (JPEG)

Rationale

• Personal Background• Versatile - Two Mediabench Apps• Completely Feedforward • Platform to Compare Wavelets

With

Prior Art

• Common ASIC• Jeffrey Jacob (Toronto) Master

Thesis:– “Memory Interfacing for OneChip

Reconfigurable Processor”– 2D DCT on Altera Flex10K50

• Xilinx 2D DCT implementation

Implementation

• Many fast algorithms exist• Two provided in Mediabench

JPEG – Loeffler, Lightenberg, Moschytz

• jfdctint.c

– Arai, Agui, Nakajima• jfdctfst.c

• Both attempted

General Data Flow

Source: Jeffrey Jacob Master Thesis

General Data Flow

19 adds 5 mults 10 adds

Basic Strategy

• Whole design based on Ripple Adders– multipliers decomposed into Ripple

Adders– simple to manipulate– placement not restricted by Cascades– only need to skew at input and output

Spatial Implementation

Vital Statistics

• Level 11 Array- 2048 BLB’s• 8*9 = 72 Inputs • 2*(12)+2*(13)+4*(14) = 106 Outputs• 547 BLB’s of Logic

– only 26% usage

• Latency: 156 Cycles• 8-deep retiming will add 284 BLB’s

Speedup

• Feed-forward implies a set of outputs every cycle

• “gcc -O3” compilation of jfdctfst.c• on Sun Ultrasparc 10• gives 0.2 seconds for 1048576 1D DCT’s• 190 ns per DCT => Speedup ratio of 47

Speedup = (250 Mhz)*(MP cycles per DCT)/(Rate of MP)

Quality/Capacity (1)

• Choice of AAN itself a Quality decision

• Estimate of 13 BLB’s saved per bit of precision in mults

• Increased PSNR probably a fluke, but reduced capacity designs definitely worth considering

Precision BLB's PSNRBase 547 32.40Base-2 521 32.55Base-3 503 32.57Base-4 490 19.40

Quality/Capacity (2)

Base-2 Precision

Quality/Capacity (3)

Base-4 Precision

SCORE

• Think in terms of whole JPEG application

• Reconsider block diagram in greater detail

SCORE: Big Detailed Picture

SCORE: Swapping Scheme

Power Statistics

• HSRA - 668 cycle run– Activity:

• LUT Outputs: 0.288• System Inputs: 0.232• Total: 0.285

Power Statistics

• RippleAdders– Most active LUTs in lower significant

bits of adders- especially in input array

– Higher order outputs flip sign frequently, causing bit toggling throughout two’s complement representation

Power Statistics

Power Statistics

• HSRA (600 cycle simulation):– Energy:

• LUT Outputs: 1095.57 pJ• Inputs: 137.29 pJ• Clock Energy: 6412800 pJ

» (300 pJ)(2^(levels-6))(cycles)

• Total: 6414033 pJ

Retrospect

• BOOM? – Current code size ~ 1500 lines of Java– Jeffrey Jacobs RTL (2D) ~2600 lines

• Primary concern not with architecture but with backend tools– postscript output of ar?

Future Directions

• Reduced precision• Cascade-LUTs• More “spatially” suited

implementations of DCT?– Find one or make one up!

• IDCT

top related