Outline
• Introduction / Overview• Rationale / Prior Art• Implementation
Strategy• Quality / Capacity• SCORE • Power
Overview
• Backbone of Image Compression Standards
• JPEG, MPEG• still image compression of 8-10:1
• Similarity Transform on Image Data
• B = CACT
Rationale
• Personal Background• Versatile - Two Mediabench Apps• Completely Feedforward • Platform to Compare Wavelets
With
Prior Art
• Common ASIC• Jeffrey Jacob (Toronto) Master
Thesis:– “Memory Interfacing for OneChip
Reconfigurable Processor”– 2D DCT on Altera Flex10K50
• Xilinx 2D DCT implementation
Implementation
• Many fast algorithms exist• Two provided in Mediabench
JPEG – Loeffler, Lightenberg, Moschytz
• jfdctint.c
– Arai, Agui, Nakajima• jfdctfst.c
• Both attempted
Basic Strategy
• Whole design based on Ripple Adders– multipliers decomposed into Ripple
Adders– simple to manipulate– placement not restricted by Cascades– only need to skew at input and output
Vital Statistics
• Level 11 Array- 2048 BLB’s• 8*9 = 72 Inputs • 2*(12)+2*(13)+4*(14) = 106 Outputs• 547 BLB’s of Logic
– only 26% usage
• Latency: 156 Cycles• 8-deep retiming will add 284 BLB’s
Speedup
• Feed-forward implies a set of outputs every cycle
• “gcc -O3” compilation of jfdctfst.c• on Sun Ultrasparc 10• gives 0.2 seconds for 1048576 1D DCT’s• 190 ns per DCT => Speedup ratio of 47
Speedup = (250 Mhz)*(MP cycles per DCT)/(Rate of MP)
Quality/Capacity (1)
• Choice of AAN itself a Quality decision
• Estimate of 13 BLB’s saved per bit of precision in mults
• Increased PSNR probably a fluke, but reduced capacity designs definitely worth considering
Precision BLB's PSNRBase 547 32.40Base-2 521 32.55Base-3 503 32.57Base-4 490 19.40
Power Statistics
• HSRA - 668 cycle run– Activity:
• LUT Outputs: 0.288• System Inputs: 0.232• Total: 0.285
Power Statistics
• RippleAdders– Most active LUTs in lower significant
bits of adders- especially in input array
– Higher order outputs flip sign frequently, causing bit toggling throughout two’s complement representation
Power Statistics
• HSRA (600 cycle simulation):– Energy:
• LUT Outputs: 1095.57 pJ• Inputs: 137.29 pJ• Clock Energy: 6412800 pJ
» (300 pJ)(2^(levels-6))(cycles)
• Total: 6414033 pJ
Retrospect
• BOOM? – Current code size ~ 1500 lines of Java– Jeffrey Jacobs RTL (2D) ~2600 lines
• Primary concern not with architecture but with backend tools– postscript output of ar?