Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia.

Vector Processing as a Soft-core CPU Accelerator

Jason Yu, Guy Lemieux, Chris Eagleston

{jasony, lemieux, ceaglest}@ece.ubc.caUniversity of British Columbia

Prepared for FPGA2008, Altera, and XilinxFebruary 26-28, 2008

Motivation FPGAs for embedded processing

High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow

Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator

Problems… Faster Nios/MicroBlaze not feasible

2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA

Superscalar complex dependency checking

Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency

Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function

Possible Solutions… Automatically synthesized hardware accelerators

Change software regenerate & recompile RTL Altera C2H Xilinx CHiMPS Mitrion Virtual Processor CriticalBlue Cascade

Soft vector processorSoft vector processor Change software same RTL, just recompile software

Purely software-based Decouples hardware/software development teams

Advantages of Vector Processing Simple programming model

Short to long vector data parallelism Regular, easy to accelerate

Purely software-based One hardware accelerator supports many

applications

Scalable performance and area

Contributions Configurable soft vector processor

Selectable performance/resource tradeoff Area customization

FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory

Overview of Vector Processing

Acceleration with Vector Processing Organize data as long vectors Data-level parallelism

Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation

over length of vector

SourceSourcevectorvector

registersregisters

DestinationDestinationvectorvectorregisterregister

Vector lanes

for (i=0; i<NELEM; i++) a[i] = b[i] * c[i]

vmult a, b, c

Compared to CPUs with SIMD Extensions Intel SSE2, PowerPC

Altivec, etc Short, fixed-length

vectors (eg, 4) Single cycle per

instruction Many data

pack/unpack instructions

SourceSourceSIMDSIMD

registersregisters

DestinationDestinationSIMDSIMDregisterregister

SIMD Unit

Hybrid vector-SIMD vs Traditional Vector

Traditional vectorprocessing

HybridVector-SIMDprocessing

For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] }

Vector ISA Features Vector length (VL)

register Conditional execution

Vector flag registers

Vector addressing modes Unit stride Constant stride Indexed offset

Sourceregisters

DestinationregisterFlag

register

Vector Merge Operation

Example: Simple 5x5 Median Filtering

Pseudocode (Bubble sort)

Load the 25 pixel vectors P[0..24]For i=0 to 12 {

minimum = P[i]For j=i to 24 {

if (P[j] < minimum) {swap (minimum, P[j])}

Slide “window” over after 1 median

Repeated over entire image Many windows

Output pixelOutput pixel

Example: Simple 5x5 Median Filtering

Pseudocode (Bubble sort)

Load the 25 pixel vectors P[0..24]For i=0 to 12 {

minimum = P[i]For j=i to 24 {

if (P[j] < minimum) {swap (minimum, P[j])}

Bubble sort on vector registers

Vector flag register to mask execution

“VL” results at once!

25 rows ->25 vector registers

“VL” pixels each

Soft Vector Processor Architecture

Nios II coreShared instructionmemory

(scalar / vectorinstructions)

Shared scalar / vectorMemory interface

Distributedvector register file

Overlappedscalar / vector

execution

Configurablememory width

Configurablenumber of lanes

One vectorRegister(eg, v0)

Distributedvector register file

Local vectordatapath memory

MAC chain

Result toVLane 0

Vector Sum Reduction with MAC

Sum reduction

R = A[i] * B[i]

R = A[i] (using B[i] = 1)

Reduces VL elements in vector register to single number

Two instruction sequence: vmac

multiply accum. to accumulators vcczacc

compress copy and zero accumulators

Side effect: can only reduce 18-bit inputs

Accumulatechain

Configurable Parameters Some configurable features

Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none)

Strongly affect performance, area

Partial List of Configurable Parameters

Primary ParametersSoft vector processors

Parameter Description Typical V4 V8 V16M32

NLane Number of vector lanes 4-128 4 8 16

MVL Maximum vector length 16-512 16 32 64

VPUW Processor data width (bits) 8, 16, 32 32 32 32

MemMinWidth

Minimum accessible data width in memory

8, 16, 32 8 8 32

Parameters for Optional Features

MultW Multiplier width (bits, 0 is off) 0, 8, 16, 32 16 16 16

MACL MAC chain length (0 is no MAC) 0,1,2,4 1 2 0

LMemN Local memory number of words 0-1024 256 256 0

LMemShare Shared local memory address space within lane

On/Off Off Off Off

Performance Results

Benchmarking 3 sample application kernels

5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench)

C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator

Methodology and Assumptions Compile C code with nios2-gcc

Run time Instructions * cycles-per-instruction / Fmax

Nios II Instruction: 1 cycle Memory load: 1 cycle

Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles

Altera C2H Compiler Nios II with C2H accelerator

Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations:

No automatic loop unrolling Up to user to efficiently partition memory

Memory

Nios IIC2H

acceleratorC2H

accelerator

Arbiter

Memory

Master portsMaster ports

ArbiterAvalonFabric

C2H Methodology Compile application kernels with C2H

compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code

Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead

C2H Example AES encryption round

Shift 4 32-bit words(by different amounts)

4 table lookups XOR results, XOR with key

Acceleration steps1. Process multiple blocks in parallel (increase

array sizes)2. Manually create 4 on-chip memories for 4

lookup tables

32-bit word

Vector soft processordesign flow

Design vectoralgorithm

Develop C code,vector assembly

Compile source code,assemble with

vector assembler

Compile source code,assemble with

vector assembler

Result meetsrequirements?

Synthesize system,place and route

Configure softprocessor parameters

Download applicationto processor

Determine FPGAresource budget

Hardware acceleratordesign flow

Develop Ccode for Nios II

Identify areas forHW acceleration

Isolate sectionsto accelerate into

C functions

Isolate sectionsto accelerate into

C functions

Analyze compilationestimates

Result meetsrequirements?

Tune systemarchitecture

Apply optimizationsto C source code

Run C2H compilerRun C2H compiler

Hardware-awaretransformations Software-only

optimizations

Programmersknow how to

do this!

Synthesize system,place and route/

Resource Utilization

NormalizedResource

(to smallestStratix III)

C2H AES V4 V8 V1

ALM M9K DSP Elements

Biggest Stratix III = 7x more resources

Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.

Resource Utilization Estimates

ALM DSP Elements M9K Fmax

Smallest Stratix III 19000 216 108 -

Nios II/s 489 8 4 153

+ C2H Median filtering 825 8 4* 147

+ C2H Motion estimation 977 10 4* 135

+ C2H AES encryption 2480 8 6* 119

UTIIe 324 0 3 193

+V4 5215 21 32 115

+V8 7011 34 53 114

+V16 10266 58 95 113

* C2H results are obtained from compiling to Stratix II; uses M4K memories

Results: Clock Cycles

00.10.20.30.40.50.60.70.80.9

Normalized Clock Cycles

(to idealNios II)

Nios II C2H V4 V8 V16

Median filtering Motion estimation AES encryption

Speedup vs Resource Utilization Summary

0 5 10 15 20 25 30

Normalized Area (Number of ALMs)

Nios II/s

Vector

Median filteringAES encryptionMotion estimation

Summary of Effort C2H accelerators

1. “Vectorize” code for C2H: 1 day2. Extra-effort optimization: 1 day3. Place-and-route waiting: 1 hour

Each iteration = 1 day + P&R

Vector soft processor1. Vector algorithm, write vector assembly: 2 days2. Revise vector algorithm: 0.5 day

Each iteration = 0.5 day + SW compile only

Lessons from Vector Processor Design Register files

2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing

Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor

(128b, byte access)

Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful

Would like 32-bit MAC chain

Current Progress Development toolchain integration

Packaged as SOPC builder component No built-in debug core

Uses real Nios II processor to download code on to system

Inline vector assembly in Nios II IDE

Future work Compiler Floating-point

Conclusion Vector processing maps well to FPGA

Many small memories, DSP blocks Simple programming model

Soft vector processor Purely software-based acceleration

No hardware design / RTL recompile needed—just program One hardware accelerator supports many applications

Scalable performance and area More vector lanes more performance for more area Soft core parameters/features area customization

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia.

area slide

function slide

fpga niosmicroblaze

software regenerate

problems faster niosmicroblaze

guy lemieux

file maps

system design cache

Documents

Temporal Coherency for Video Tone Mapping - ece.ubc.ca

1997 Card e Lemieux

Lemieux v. Agate Land Co Testimony Summary

Peter Lemieux v. US

Lemieux v. Agate Land Co Testimony Transcript

Mario Lemieux Real Estate Service

LeMieux Product Brochure

Lemieux Creek Water Availability Study - British...

Imbens Lemieux Regression Discontinuity Joe08 33190367

Tim O'Shei-Mario Lemieux (Overcoming Adversity) (2001)

G. Lemieux - Brf 42 Publications

Éric Lavoie, CA, CIA, CCSA, associé Lemieux Nolet

histoiredeweedon.info...Joseph - Cyrenus Lemieux et Orpha...

Imbens Lemieux Regression discontinuity

Beyond the Countermajoritarian Difficulty - Lemieux

AEG modeli.pdf · aeg 4123trio 5037 5040iwa 6922parketto...