Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.

Fall 2006 Lecture 16

Lecture 16: Accelerator Design in the XUP Board

ECE 412: Microcomputer Laboratory


Objectives

• Understand accelerator design considerations in a practical FPGA environment

• Gain knowledge in some details of the XUP platform required for efficient accelerator design


Four Fundamental Models of Accelerator Design

CPU FPGA

OSmmap()

App

CPU FPGA

App

Device Driver

CPU FPGAApp

(a)

(d)(c)

(b)

OS

CPU FPGA

OSApp

Base

No OS Service(in simpleembeddedsystems)

OS serviceacc asUser spacemmaped I/O device

VirtualizedDevice withOS schedsupport


Hybrid Hardware/Software Execution Model

CPUFPGAaccele-rators

memory

devices

Linux OS

Linker/Loader

Application

DLL

OS

mod

ule

s

Compiler analysis/transformations

Synthesis

Soft object

Hard objectUser level function or device driver:

Source code

Resource manager

Compile Time

User Runtime

Kernel Runtime

Human designedhardware

• Hardware Accelerator as a DLL– Seamless integration of hardware accelerators

into the Linux software stack for use by mainstream applications

– The DLL approach enables transparent interchange of software and hardware components

• Application level execution model– Compiler deep analysis and transformations

generate CPU code, hardware library stubs and synthesized components

– FPGA bitmaps as hardware counterpart to existing software modules.

– Same dynamic linking library interfaces and stubs apply to both software and hardware implementation

• OS resource management– Services (API) for allocation, partial

reconfiguration, saving and restoring the status, and monitoring

– Multiprogramming scheduler can pre-fetch hardware accelerators in time for next use

– Control the access to the new hardware to ensure trust under private or shared use


MP3 Decoder: Madplay Lib. Dithering as DLL

• Madplay shared library dithering function as software and FPGA DLL

– Audio_linear_dither() software profiling shows 97% of application time

– DL (dynamic linker) can switch the call to hardware or software implementation

• Used by ~100 video and audio applications

Application

Sound driver

AC’97

OS

FPGA

Stub

Software Dithering DLL

QuantizationClippingDitheringRandom generatorBiasingNoise Shaping

QuantizationClippingDithering

Random generator

BiasingNoise Shaping

Hardware Dithering DLL

Hardware Dithering

6 cycles

Decode MP3Block

Read Sample

DLWrite

Sample

Application

Sound driver

AC’97

OS

FPGA

Stub

Software Dithering DLL

QuantizationClippingDitheringRandom generatorBiasingNoise Shaping QuantizationClippingDitheringRandom generatorBiasingNoise Shaping

QuantizationClippingDithering

Random generator

BiasingNoise Shaping QuantizationClippingDithering

Random generator

BiasingNoise Shaping

Hardware Dithering DLL

Hardware Dithering

Decode MP3Block

Read Sample

DLWrite

Sample


CPU-Accelerator Interconnect Options

PowerPC

PLB

OCM

DDR Controller

Motion Estimation Accelerator D

DR

RA

M

Virtex-II ProBRAM

4

1

2

3

Bus

PLB

OCM

DCR

Read Write Read Write

21

4

3

20

3

3

3 3

2

3

2

3

Arbitration

15

-

-

First Access Pipelined

• PLB (Processor Local Bus)– Wide transfer – 64 bits– Access to DRAM channel– 1/3 CPU frequency– Big penalty if bus is busy during

first attempt to access bus

• OCM (On-chip Memory) interconnect

– Narrower – 32 bits– No direct access to DRAM

channel– CPU clock frequency


Motion Estimation Design & Experience

PowerPC

PLB

OCM

DDR Controller

Motion Estimation Accelerator D

DR

RA

M

Virtex-II ProBRAM

4

1

2

3• Significant overhead in mmap,

open calls– This arrangement can only

support accelerators that will be invoked many times

– Notice dramatic reduction in computation time

– Notice large overhead in data marshalling and white

• Full Search gives 10% better compression

– Diamond Search is sequential, not suitable for acceleration

Accelerator Action

open System Call

mmap System Call

Data Marshaling + Write

Initiation

Accelerator Computation

Read From Accelerator

cycles

34,000

90,600

24,000

180

5,400

300

mseconds

113.3

80.0

302

0.600

18

1

Total Time per Call

Full Search (HW)

Diamond Search (SW)

Full Search (SW)

96,480

174,911

639,925

322

583

2,133

Overhead


JPEG: An Example

RGB

2D Discrete Cosine Transform

(DCT)

Run-Length Encoding (RLE)

Huffman Coding (HC)

Quantization (QUANT)

Original Image

Compressed Image

Parallel Execution on Independent Blocks

Inherently Sequential Region

Implemented as Reconfigurable Logic

Accelerator Candidate

Downsample

Downsample

Downsample

Y

U

V

RGBto

YUV


JPEG Accelerator Design & Experience

Input Buffer

DCT Quant

Framework Control

Output Buffer

DMA Controller

DC

Rs

PLBControl

Data • Based on Model (d)– System call overhead for each

invocation– Better protection

• DCT and Quant are accelerated– Data flows directly from DCT to Quant

• Data copy to user DMA buffer dominates cost

PowerPC

PLB

DDR Controller

Switchable Accelerator Interconnect Framework

Virtex-II Pro

DC

Rs

Control

Data

DD

R R

AM

System Call Overhead

DMA Setup

DMA Transfers

Action

Accelerator Execution

Cache Coherence

Data Copies

cycles

1,853

mseconds

6.18

549

448

987

348

1060

1.83

1.49

3.29

1.16

3.53

Overhead

Total Time 5,244 17.5


Execution Flow of DCT System Call

Application Operating System Hardware

Tim

e

open(/dev/accel); /* only once*/… /* construct macroblocks */macroblock = …syscall(&macroblock, num_blocks)…

… /* macroblock now has transformed data */…

Data copy

PPCFlush Cache Range

Setup DMA TransferPPC

Poll

DMA Controller

Setup DMA Transfer

Invalidate Cache Range

MemoryPLB

PLB

PPC Accelerator(Executing)

DCR

Data Copy

PPC Memory

PLB

PPC DMA Controller

PLB

PPC Memory

PLB

PPC MemoryPLB

Enable Accelerator Access for Application


Software Versus Hardware Acceleration

Overhead is a major issue!


Device Driver Access Cost

Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.

Documents

hardware counterpart

new hardware

hardware library stubs

fpga dll audio

software implementation

sample slide

dll approach

os sched support slide