Fall 2006 Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory
Dec 20, 2015
Fall 2006 Lecture 16
Lecture 16: Accelerator Design in the XUP Board
ECE 412: Microcomputer Laboratory
Fall 2006 Lecture 16
Objectives
• Understand accelerator design considerations in a practical FPGA environment
• Gain knowledge in some details of the XUP platform required for efficient accelerator design
Fall 2006 Lecture 16
Four Fundamental Models of Accelerator Design
CPU FPGA
OSmmap()
App
CPU FPGA
App
Device Driver
CPU FPGAApp
(a)
(d)(c)
(b)
OS
CPU FPGA
OSApp
Base
No OS Service(in simpleembeddedsystems)
OS serviceacc asUser spacemmaped I/O device
VirtualizedDevice withOS schedsupport
Fall 2006 Lecture 16
Hybrid Hardware/Software Execution Model
CPUFPGAaccele-rators
memory
devices
Linux OS
Linker/Loader
Application
DLL
OS
mod
ule
s
Compiler analysis/transformations
Synthesis
Soft object
Hard objectUser level function or device driver:
Source code
Resource manager
Compile Time
User Runtime
Kernel Runtime
Human designedhardware
• Hardware Accelerator as a DLL– Seamless integration of hardware accelerators
into the Linux software stack for use by mainstream applications
– The DLL approach enables transparent interchange of software and hardware components
• Application level execution model– Compiler deep analysis and transformations
generate CPU code, hardware library stubs and synthesized components
– FPGA bitmaps as hardware counterpart to existing software modules.
– Same dynamic linking library interfaces and stubs apply to both software and hardware implementation
• OS resource management– Services (API) for allocation, partial
reconfiguration, saving and restoring the status, and monitoring
– Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use
– Control the access to the new hardware to ensure trust under private or shared use
Fall 2006 Lecture 16
MP3 Decoder: Madplay Lib. Dithering as DLL
• Madplay shared library dithering function as software and FPGA DLL
– Audio_linear_dither() software profiling shows 97% of application time
– DL (dynamic linker) can switch the call to hardware or software implementation
• Used by ~100 video and audio applications
Application
Sound driver
AC’97
OS
FPGA
Stub
Software Dithering DLL
QuantizationClippingDitheringRandom generatorBiasingNoise Shaping
QuantizationClippingDithering
Random generator
BiasingNoise Shaping
Hardware Dithering DLL
Hardware Dithering
6 cycles
Decode MP3Block
Read Sample
DLWrite
Sample
Application
Sound driver
AC’97
OS
FPGA
Stub
Software Dithering DLL
QuantizationClippingDitheringRandom generatorBiasingNoise Shaping QuantizationClippingDitheringRandom generatorBiasingNoise Shaping
QuantizationClippingDithering
Random generator
BiasingNoise Shaping QuantizationClippingDithering
Random generator
BiasingNoise Shaping
Hardware Dithering DLL
Hardware Dithering
Decode MP3Block
Read Sample
DLWrite
Sample
Fall 2006 Lecture 16
CPU-Accelerator Interconnect Options
PowerPC
PLB
OCM
DDR Controller
Motion Estimation Accelerator D
DR
RA
M
Virtex-II ProBRAM
4
1
2
3
Bus
PLB
OCM
DCR
Read Write Read Write
21
4
3
20
3
3
3 3
2
3
2
3
Arbitration
15
-
-
First Access Pipelined
• PLB (Processor Local Bus)– Wide transfer – 64 bits– Access to DRAM channel– 1/3 CPU frequency– Big penalty if bus is busy during
first attempt to access bus
• OCM (On-chip Memory) interconnect
– Narrower – 32 bits– No direct access to DRAM
channel– CPU clock frequency
Fall 2006 Lecture 16
Motion Estimation Design & Experience
PowerPC
PLB
OCM
DDR Controller
Motion Estimation Accelerator D
DR
RA
M
Virtex-II ProBRAM
4
1
2
3• Significant overhead in mmap,
open calls– This arrangement can only
support accelerators that will be invoked many times
– Notice dramatic reduction in computation time
– Notice large overhead in data marshalling and white
• Full Search gives 10% better compression
– Diamond Search is sequential, not suitable for acceleration
Accelerator Action
open System Call
mmap System Call
Data Marshaling + Write
Initiation
Accelerator Computation
Read From Accelerator
cycles
34,000
90,600
24,000
180
5,400
300
mseconds
113.3
80.0
302
0.600
18
1
Total Time per Call
Full Search (HW)
Diamond Search (SW)
Full Search (SW)
96,480
174,911
639,925
322
583
2,133
Overhead
Fall 2006 Lecture 16
JPEG: An Example
RGB
2D Discrete Cosine Transform
(DCT)
Run-Length Encoding (RLE)
Huffman Coding (HC)
Quantization (QUANT)
Original Image
Compressed Image
Parallel Execution on Independent Blocks
Inherently Sequential Region
Implemented as Reconfigurable Logic
Accelerator Candidate
Downsample
Downsample
Downsample
Y
U
V
RGBto
YUV
Fall 2006 Lecture 16
JPEG Accelerator Design & Experience
Input Buffer
DCT Quant
Framework Control
Output Buffer
DMA Controller
DC
Rs
PLBControl
Data • Based on Model (d)– System call overhead for each
invocation– Better protection
• DCT and Quant are accelerated– Data flows directly from DCT to Quant
• Data copy to user DMA buffer dominates cost
PowerPC
PLB
DDR Controller
Switchable Accelerator Interconnect Framework
Virtex-II Pro
DC
Rs
Control
Data
DD
R R
AM
System Call Overhead
DMA Setup
DMA Transfers
Action
Accelerator Execution
Cache Coherence
Data Copies
cycles
1,853
mseconds
6.18
549
448
987
348
1060
1.83
1.49
3.29
1.16
3.53
Overhead
Total Time 5,244 17.5
Fall 2006 Lecture 16
Execution Flow of DCT System Call
Application Operating System Hardware
Tim
e
open(/dev/accel); /* only once*/… /* construct macroblocks */macroblock = …syscall(¯oblock, num_blocks)…
… /* macroblock now has transformed data */…
Data copy
PPCFlush Cache Range
Setup DMA TransferPPC
Poll
DMA Controller
Setup DMA Transfer
Invalidate Cache Range
MemoryPLB
PLB
PPC Accelerator(Executing)
DCR
Data Copy
PPC Memory
PLB
PPC DMA Controller
PLB
PPC Memory
PLB
PPC MemoryPLB
Enable Accelerator Access for Application
Fall 2006 Lecture 16
Software Versus Hardware Acceleration
Overhead is a major issue!
Fall 2006 Lecture 16
Device Driver Access Cost