Top Banner
Automatic generation of platform architectures using OpenCL FPGA roadmap Department of Electrical and Computer Engineering University of Thessaly Volos, Greece Nikolaos Bellas
27

Automatic generation of platform architectures using open cl and fpga roadmap

May 10, 2015

Download

Technology

Manolis Vavalis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic generation of platform architectures using open cl and fpga roadmap

Automatic generation of platform architectures using OpenCL

FPGA roadmap

Department of Electrical and Computer EngineeringUniversity of Thessaly

Volos, Greece

Nikolaos Bellas

Page 2: Automatic generation of platform architectures using open cl and fpga roadmap

What is an FPGA?• Field Programmable Gate Array (FPGA) is the best

known example of Reconfigurable Logic• Hardware can be modified post chip fabrication• Tailor the Hardware to the application

– Fixed logic processors (CPUs/GPUs) only modify their software (via programming)

• FPGAs can offer superior performance, performance/power, or performance/cost compared to CPUs and GPUs.

2

Page 3: Automatic generation of platform architectures using open cl and fpga roadmap

FPGA architecture

• A generic island-style FPGA fabric

• Configurable Logic Blocks (CLB) and Programmable

Switch Matrices (PSM)• Bitstream configures

functionality of each CLB and interconnection between logic blocks

3

Page 4: Automatic generation of platform architectures using open cl and fpga roadmap

The Xilinx Slice• Xilinx slice features

– LUTs– MUXF5, MUXF6,

MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)

– Carry Logic– MULT_ANDs– Sequential Elements

•Detailed Structure

Page 5: Automatic generation of platform architectures using open cl and fpga roadmap

LUTLUT

Example 2-input LUT

• Lookup table: a b out

0 0

0 1

1 0

1 1

a

bout

0

0

0

1

0 0 0 1

1

0

0

1

1 0 0 1

5

configuration input

Page 6: Automatic generation of platform architectures using open cl and fpga roadmap

Modern FPGA architectureXilinx Virtex family

6

•Columns of on-chips SRAMs, hard IP cores (PPC 405), and•DSP slices (Multiply-Accumulate) units

Page 7: Automatic generation of platform architectures using open cl and fpga roadmap

FPGA discussion

• Advantages– Potential for (near) optimal performance for a given

application– Various forms of parallelisms can be exploited

• Disadvantages– Programmable mainly at the hardware level using

Hardware Description Languages (BUT, this can change)

– Lower clock frequency (200-300 MHz) compared to CPUs (~ 3GHz) and GPUs (~1.5 GHz)

7

Page 8: Automatic generation of platform architectures using open cl and fpga roadmap

MATENVMED

Silicon OpenCL: Automatic generation of platform

architectures using OpenCL

8

Page 9: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Introduction

• Automatic generation of hardware at the research forefront in the last 10 years.

• Variety of High Level Programming Models: C/C++, C-like Languages, MATLAB

• Obstacles:– Parallelism Extraction for larger applications– Extensive Compiler Transformations & Optimizations

• Parallel Programming Models to the Rescue: – CUDA, OpenCL.

9

Page 10: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Motivation

• Parallel programming models are for reconfigurable platforms.

• A major shift of Computing industry toward many-core computing systems.

• Reconfigurable fabrics bear a strong resemblance to many core systems.

10

Page 11: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Vision• Provide the tools and methodology to enable the large

pool of software developers and domain experts, who do not necessarily have expertise on hardware design, to architect whole accelerator-based systems– Borrowed from advances in massively parallel programming

models

11FPGA

PCI express

GPU

CPU

PCI express

Page 12: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Silicon OpenCL

• Silicon-OpenCL “SOpenCL”.

• A tool flow to convert an unmodified OpenCL application into a SoC design with HW/SW components.

OpenCL-to-C Frontend

OpenCL Kernel

Architectural Synthesis Backend

Simulation & Verification

Drivers & Runtime

On-Chip CPU

HW Accelerator

HW Accelerator

On-chip Bus

Off-Chip Memory

C Function

System on Chip(SoC)

Page 13: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013

Contribution

• Architectural Synthesis methodology:– Code Transformations.– Architectural Template.

13

C Kernel

LLVM Compilation

Optimized LLVM-IR

Bitwidth Optimization

Predication

Code Slicing

Instruction Clustering

Verilog Generation

SchedulingFPGA

BitstreamSynthesis,

P&RSynthesizable

Verilog

Simulation TestbenchAccelerator Template

User Performance Requirements

Transformations

Hardware Generation

OpenCL Kernel

OpenCL-to-C Frontend

Architectural Synthesis Backend

Streaming Unit

Datapath

Input Data

Output Data

MATENVMED Plenary Meeting

Page 14: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

OpenCL for Heterogeneous Systems• OpenCL (Open Computing Language) : A unified programming

model aims at letting a programmer write a portable program once and deploy it on any heterogeneous system with CPUs and GPUs.

• Became an important industry standard after release due to substantial industry support.

14

Page 15: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

OpenCL Platform Model

One host and one or more Compute Devices (CD) Each CD consists of one or more Compute Units (CU)Each CU is further divided into one or more Processing Elements (PE)

15

Main Program

ComputationsKernels

Page 16: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

OpenCL Kernel Execution Geometry• OpenCL defines a geometric partitioning of grid of computations• Grid consists of N dimensional space of work-groups• Each work-group consists of N dimensional space of work-items.

work-group

grid work-item

16

Page 17: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

OpenCL Simple Example

__kernel void vadd(

__global int* a,

__global int* b,

__global int* c) {

int idx= get_global_id(0);

c[idx] = a[idx] + b[idx];

}

• OpenCL kernel describes the computation of a work-item• Finest parallelism granularity

• e.g. add two integer vectors (N=1)

void add(int* a,

int* b,

int* c) {

for (int idx=0; idx<sizeof(a); idx++)

c[idx] = a[idx] + b[idx];

}

C code OpenCL kernel code

Run-time callUsed to differentiate execution for each work-item

17

Page 18: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Why OpenCL as an HDL?

• OpenCL exposes parallelism at the finest granularity– Allows easy hardware generation at different levels of

granularity

– One accelerator per work-item, one accelerator per work-group, one accelerator per multiple work-groups, etc.

• OpenCL exposes data communication– Critical to transfer and stage data across platforms

• We target unmodified OpenCL to enable hardware design to software engineers– No need for hardware/architectural expertise

18

Page 19: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

SOpenCL Tool Flow

OpenCL-to-C Frontend

OpenCL Kernel

Architectural Synthesis Backend

Simulation & Verification

Drivers & Runtime

On-Chip CPU

HW Accelerator

HW Accelerator

On-chip Bus

Off-Chip Memory

C Function

System on Chip(SoC)

Page 20: Automatic generation of platform architectures using open cl and fpga roadmap

Granularity Management

work-group

FPGA

Optimal thread granularity depends on hardware platform

GPUCPU

We select a hardware accelerator to process one work-group per invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting

Page 21: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Granularity Coarsening

OpenCL-to-C Frontend

OpenCL Kernel

C Function

Work-item thread

Work-group thread

Page 22: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Serialization of Work Items

__kernel void vadd(…) {

int idx = get_global_id(0); c[idx] = a[idx] + b[idx];

}

__kernel void Vadd(…) { int idx; for( i = 0; i < get_local_size(2); i++) for( j = 0; j < get_local_size(1); j++) for( k = 0; k < get_local_size(0); k++) { idx = get_item_gid(0); c[idx] = a[idx] + b[idx]; }}

OpenCL code

C code

22

idx = (global_id2(0) + i) * Grid_Width * Grid_Height + (global_id1(0) + j) * Grid_Width + (global_id0(0) + k);

Page 23: Automatic generation of platform architectures using open cl and fpga roadmap

18-19/7/2013 MATENVMED Plenary Meeting

Architectural Synthesis• Exploit available parallelism and application specific

features.• Apply a series of transformations to generate customized

hardware accelerators.

23

C Kernel

LLVM Compilation

Optimized LLVM-IR

Bitwidth Optimization

PredicationCode

SlicingInstruction Clustering

Verilog Generation

SchedulingFPGA

BitstreamSynthesis,

P&RSynthesizable

Verilog

Simulation TestbenchAccelerator Template

User Performance Requirements

Transformations

Hardware Generation

• Uses LLVM Compiler Infrastructure.• Generate synthesizable Verilog & Test bench.

Page 24: Automatic generation of platform architectures using open cl and fpga roadmap

Arbiter

Sin Align UnitSout Align

Unit

Sin Requests

Generator

Cache Unit

Sout AGU

Sin AGU

Data_lineData_line

AddressAddress Data_inData_in

Data_outData_out

AddressAddress

Sin0Sin0 Sin1Sin1 Sout0Sout0

Streaming UnitStreaming UnitSystem InterconnectSystem Interconnect

Local requestLocal request

FU

Data

TerminateTerminate Sin0Sin0 Sin1Sin1 Sout0Sout0

Data PathData Path

Named Register

Named Register

Memory Mapped Registers

Memory Mapped Registers

Multiplexer

TunnelTunnel

Data

FU

Multiplexer

Data

FU

Multiplexer

DataData

Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1

Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i9 = mul i7, a3 i10 = pop i8* gep1 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body

Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4

Feed Data in Order

Write Data in Order

•FU types,•Bitwidths,

•I/O Bandwidth

2404/12/23

Verilog Generation: PE Architecture•Predication

•Code

•slicing

•SMS mod

•scheduling

•Verilog

•generation

MATENVMED Kickc Off Meeting

Page 25: Automatic generation of platform architectures using open cl and fpga roadmap

Roadmap for FPGA implementation

Page 26: Automatic generation of platform architectures using open cl and fpga roadmap

FPGA Implementation

• Our plan is to use the same code base (e.g. OpenCL) to explore different architectures– OpenCL used for multicore CPU, GPU, FPGA

(SOpenCL)

• Fast exploration based on area, performance and power requirements

18-19/7/2013 MATENVMED Plenary Meeting

Page 27: Automatic generation of platform architectures using open cl and fpga roadmap

FPGA Implementation• Monte-Carlo simulations can exploit multi-level

parallelism of FPGAs – Multiple MC simulations per point– Multiple points simultaneously– Double precision Trigonometric, Log, Additions,

Multiplications functions for each walk– FP operations with double precision are not FPGAs

strong point, but still SOpenCL can handle it.

18-19/7/2013 MATENVMED Plenary Meeting